首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This paper presents a set of platform-independent architectural optimizations for improving the performance of software-based standard video coders. These denote changes that affect the underlying memory model and physical architecture of the encoder with an objective of achieving maximum encoder performance with respect to execution time and memory usage. The coding quality does not suffer due to these modifications because the algorithm itself is not changed. An interface driven methodology has been developed to identify and ameliorate performance bottlenecks for any encoder. Appropriate data flow between components has been proposed so that memory intensive operations including memory accesses and copying are minimized. The proposed methods have been applied to an MPEG4 reference implementation to demonstrate the computational improvements achieved while avoiding any algorithmic modifications. These techniques have been shown to result in improvements in the range of 15–50% in the overall encoding time of a video codec on a general-purpose computing platform. The resulting implementation is also shown to be faster than some well-known open source solutions.  相似文献   

2.
We present a demand-driven approach to memory leak detection algorithm based on flow- and context-sensitive pointer analysis. The detection algorithm firstly assumes the presence of a memory leak at some program point and then runs a backward analysis to see if this assumption can be disproved. Our algorithm computes the memory abstraction of programs based on points-to graph resulting from flow- and context-sensitive pointer analysis. We have implemented the algorithm in the SUIF2 compiler infrastructure and used the implementation to analyze a set of C benchmark programs. The experimental results show that the approach has better precision with satisfied scalability as expected. This work is supported by the National Natural Science Foundation of China under Grant Nos. 60725206, 60673118, and 90612009, the National High-Tech Research and Development 863 Program of China under Grant No. 2006AA01Z429, the National Basic Research 973 Program of China under Grant No. 2005CB321802, the Program for New Century Excellent Talents in University under Grant No. NCET-04-0996, and the Hunan Natural Science Foundation under Grant No. 07JJ1011.  相似文献   

3.
In service oriented architecture (SOA), service composition is a promising way to create new services. However, some technical challenges are hindering the application of service composition. One of the greatest challenges for composite service provider is to select a set of services to instantiate composite service with end- to-end quality of service (QoS) assurance across different autonomous networks and business regions. This paper presents an iterative service selection algorithm for quality driven service composition. The algorithm runs on a peer-to-peer (P2P) service execution environment--distributed intelligent service execution (DISE), which provides scalable QoS registry, dynamic service selection and service execution services. The most significant feature of our iterative service selection algorithm is that it can work on a centralized QoS registry as well as cross decentralized ones. Network status is an optional factor in our QoS model and selection algorithm. The algorithm iteratively selects services following service execution order, so it can be applied either before service execution or at service run-time without any modification. We test our algorithm with a series of experiments on DISE. Experimental results illustrated its excellent selection and outstanding performance.  相似文献   

4.
Parallel Error Detection for Leading Zero Anticipation   总被引:1,自引:0,他引:1       下载免费PDF全文
The algorithm and its implementation of the leading zero anticipation (LZA) are very vital for the performance of a high-speed floating-point adder in today's state of art microprocessor design. Unfortunately, in predicting "shift amount" by a conventional LZA design, the result could be off by one position. This paper presents a novel parallel error detection algorithm for a general-case LZA. The proposed approach enables parallel execution of conventional LZA and its error detection, so that the error-indicatlon signal can be generated earlier in the stage of normalization, thus reducing the critical path and improving overall performance. The circuit implementation of this algorithm also shows its advantages of area and power compared with other previous work.  相似文献   

5.
Microarchitecture of the Godson-2 Processor   总被引:23,自引:3,他引:23       下载免费PDF全文
The Godson project is the first attempt to design high performance general-purpose microprocessors in China. This paper introduces the microarchitecture of the Godson-2 processor which is a 64-bit, 4-issue, out-of-order execution RISC processor that implements the 64-bit MlPS-like instruction set. The adoption of the aggressive out-of-order execution techniques (such as register mapping, branch prediction, and dynamic scheduling) and cache techniques (such as non-blocking cache, load speculation, dynamic memory disambiguation) helps the Godson-2 processor to achieve high performance even at not so high frequency. The Godson-2 processor has been physically implemented on a 6-metal 0.18μm CMOS technology based on the automatic placing and routing flow with the help of some crafted library cells and macros. The area of the chip is 6,700 micrometers by 6,200 micrometers and the clock cycle at typical corner is 2.3ns.  相似文献   

6.
This paper presents an intermediate program representation called the Hierarchical Task Graph (HTG), and argues that it is not only suitable as the basis for program optimization and code generation, but it fully encapsulates program parallelism at all levels of granularity. As such, the HTG can be used as the basis for a variety of restructuring and optimization techniques, and hence as the target for front-end compilers as well as the input to source and code generators. Our implementation and testing of the HTG in the Parafrase-2 compiler has demonstrated its suitability and versatility as a potentially universal intermediate representation. In addition to encapsulating semantic information, data and control dependences, the HTG provides more information vital to efficient code generation and optimizations related to parallel code generation. In particular, we introduce the notion of precedence between nodes of the structure whose grain size can range from atomic operations to entire subprograms. This work was supported in part by the National Science Foundation under Grant No. NSF-CCR-89-57310, the U. S. Department of Energy under Grant No. DOE-DE-FG02-85ER25001, and a grant from Texas Instruments Inc.  相似文献   

7.
Multiple-Morphs Adaptive Stream Architecture   总被引:2,自引:0,他引:2       下载免费PDF全文
In modern VLSI technology, hundreds of thousands of arithmetic units fit on a 1cm^2 chip. The challenge is supplying them with instructions and data. Stream architecture is able to solve the problem well. However, the applications suited for typical stream architecture are limited. This paper presents the definition of regular stream and irregular stream, and then describes MASA (Multiple-morphs Adaptive Stream Architecture) prototype system which supports different execution models according to applications' stream characteristics. This paper first discusses MASA architecture and stream model, and then explores the features and advantages of MASA through mapping stream applications to hardware. Finally MASA is evaluated by ten benchmarks. The result is encouraging.  相似文献   

8.
The adjoint code generator (ADG) is developed to produce the adjoint codes, which are used to analytically calculate gradients and the Hessian-vector products with the costs independent of the number of the independent variables. Different from other automatic differentiation tools, the implementation of ADG has advantages of using the least program behavior decomposition method and several static dependence analysis techniques. In this paper we first address the concerned concepts and fundamentals, and then introduce the functionality and the features of ADG. In particular, we also discuss the design architecture of ADG and implementation details including the recomputation and storing strategy and several techniques for code optimization. Some experimental results in several applications are presented at the end. Supported by the National Natural Science Foundation of China (Grant Nos. 60503031, 10871014), and the National Basic Research Program of China (Grant No. 2004CB418304)  相似文献   

9.
This paper investigates how to maintain an efficient dynamic ordered set of bit strings, which is an important problem in the field of information search and information processing. Generally, a dynamic ordered set is required to support 5 essential operations including search, insertion, deletion, max-value retrieval and next-larger-value retrieval. Based on previous research fruits, we present an advanced data structure named rich binary tree (RBT), which follows both the binary-search-tree property and the digital-search-tree property. Also, every key K keeps the most significant difference bit (MSDB) between itself and the next larger value among K’s ancestors, as well as that between itself and the next smaller one among its ancestors. With the new data structure, we can maintain a dynamic ordered set in O(L) time. Since computers represent objects in binary mode, our method has a big potential in application. In fact, RBT can be viewed as a general-purpose data structure for problems concerning order, such as search, sorting and maintaining a priority queue. For example, when RBT is applied in sorting, we get a linear-time algorithm with regard to the key number and its performance is far better than quick-sort. What is more powerful than quick-sort is that RBT supports constant-time dynamic insertion/deletion. Supported by the National Natural Science Foundation of China (Grant No. 60873111), and the National Basic Research Program of China (Grant No. 2004CB719400)  相似文献   

10.
The execution of composite Web services with WS-BPEL relies on externally autonomous Web services. This implies the need to constantly monitor the running behavior of the involved parties. Moreover, monitoring the execution of composite Web services for particular patterns is critical to enhance the reliability of the processes. In this paper, we propose an aspect-oriented framework as a solution to provide monitoring and recovery support for composite Web services. In particular, this framework includes 1) a stateful aspect based template, where history-based pointcut specifies patterns of interest cannot be violated within a range, while advice specifies the associated recovery action; 2) a tool support for runtime monitoring and recovery based on aspect-oriented execution environment. Our experiments indicate that the proposed monitoring approach incurs minimal overhead and is efficient. This work is supported by the National Natural Science Foundation of China under Grant Nos. 60673112, 90718033, the National Basic Research 973 Program of China under Grant No. 2009CB320704, and the High-Tech Research and Development 863 Program of China under Grand Nos. 2006AA01Z19B, 2007AA010301.  相似文献   

11.
Important insights into program operation can be gained by observing dynamic execution behavior. Unfortunately, many high-performance machines provide execution profile summaries as the only tool for performance investigation. We have developed a tracing library for the CRAY X-MP and CRAY-2 supercomputers that supports the low-overhead capture of execution events for sequential and multitasked programs. This library has been extended to use the automatic instrumentation facilities on these machines, allowing trace data from routine entry and exit, and other program segments, to be captured. To assess the utility of the trace-based tools, three of the Perfect Benchmark codes have been tested in scalar and vector modes with the tracing instrumentation. In addition to computing summary execution statistics from the traces, interesting execution dynamics appear when studying the trace histories. It is also possible to model application performance based on properties identified from traces. Our conclusion is that adding tracing support in Cray supercomputers can have significant returns in improved performance characterization and evaluation.An earlier version of this paper was presented at Supercomputing '90.Supported in part by the National Science Foundation under Grants No. NSF MIP-88-07775 and No. NSF ASC-84-04556, and the NASA Ames Research Center Grant No. NCC-2-559.Supported in part by the National Science Foundation under grant NSF ASC-84-04556.Supported in part by the National Science Foundation under grants NSF CCR-86-57696, NSF CCR-87-06653 and NSF CDA-87-22836 and by the National Aeronautics and Space Administration under NASA Contract Number NAG-1-613.  相似文献   

12.
Disaster recovery (DR) techniques ensure the data safety and service continuity under different natural and human-made disasters by constructing a high reliable storage system. Traditional disaster recovery methods are structure-dependent. It is hard to share the DR resources between different DR systems, which made it expensive. We present a structure-independent disaster recovery theory and its implementation methods in this paper. By backup the whole system but not just the data, the goal of device and application-independent disaster recovery has been achieved. We further present a parallel recovery model and an on demand data retrieval method based on the theory. Some implementation details of prototype recovery system are also discussed. With the methods independent from specific devices or applications, the cost of disaster recovery infrastructure can be essentially reduced by resource sharing. Experiments show that the recovery time has also been greatly shortened with little service degradation. Supported by the National Basic Research Program of China (Grant No. 2007CB311100)  相似文献   

13.
Software testing is an important technique to assure the quality of software systems, especially high-confidence systems. To automate the process of software testing, many automatic test-data generation techniques have been proposed. To generate e?ective test data, we propose a test-data generation technique guided by static defect detection in this paper. Using static defect detection analysis, our approach first identifies a set of suspicious statements which are likely to contain faults, then generates t...  相似文献   

14.
This paper presents an interactive graphics processing unit (GPU)-based relighting system in which local lighting condition, surface materials and viewing direction can all be changed on the fly. To support these changes, we simulate the lighting transportation process at run time, which is normally impractical for interactive use due to its huge computational burden. We greatly alleviate this burden by a hierarchical structure named a transportation tree that clusters similar emitting samples together within a perceptually acceptable error bound. Furthermore, by exploiting the coherence in time as well as in space, we incrementally adjust the clusters rather than computing them from scratch in each frame. With a pre-computed visibility map, we are able to efficiently estimate the indirect illumination in parallel on graphics hardware, by simply summing up the radiance shoots from cluster representatives, plus a small number of operations of merging and splitting on clusters. With relighting based on the time-varying clusters, interactive update of global illumination effects with multi-bounced indirect lighting is demonstrated in applications to material animation and scene decoration. Supported by the National Basic Research Program of China (Grant No. 2009CB320802), the National Natural Science Foundation of China (Grant No. 60833007), the National High-Tech Research & Development Progran of China (Grant No. 2008AA01Z301), and the Research Grant of the University of Macau  相似文献   

15.
Summary We present a mathematically rigorous and, at the same time, convenient method for systolic design and derive systolic designs for three matrix computation problems. Each design is synthesized from a simple program and a proposed layout of processors. The synthesis derives a systolic parallel execution, channel connections for the proposed processor layout, and an arrangement of data streams such that the systolic execution can begin. Our choices of designs are governed by formal theorems. The synthesis method is implementable and is particularly effective if implemented with graphics capability. Our implementation on the Symbolics 3600 displays the resulting designs and simulated executions graphically on the screen. The method's centerpiece, a transformation of sequential program computations into systolic parallel ones, has been mechanically proved correct.Parts of this work have been presented at the Conference on Parallel Architectures and Languages Europe (PARLE) [10]. This research has been supported in part by Grant No. 26-7603-35 from the Lockheed Missiles & Space Corporation and by Grant No. DCR-8610427 from the National Science Foundation  相似文献   

16.
Certifying Concurrent Programs Using Transactional Memory   总被引:1,自引:0,他引:1       下载免费PDF全文
Transactional memory (TM) is a new promising concurrency-control mechanism that can avoid many of the pitfalls of the traditional lock-based techniques. TM systems handle data races between threads automatically so that programmers do not have to reason about the interaction of threads manually. TM provides a programming model that may make the development of multi-threaded programs easier. Much work has been done to explore the various implementation strategies of TM systems and to achieve better perfor...  相似文献   

17.
循环流水技术运用于粗粒度可重构体系结构可带来显著性能提升.循环控制、流水线同步和存储器有效利用是其中的关键问题.文中介绍了在粗粒度可重构体系结构LEAP上循环自主流水化的硬件实现.该方法基于支持循环迭代自动调度的控制部件、数据驱动ALU和可配置静态交换路由.利用动态调度循环中操作的优势,LEAP可发掘更高的程序并行度;分布式存储访问和高效数据重用则提高了带宽利用率.实验结果表明,相对于通用处理器,LEAP有13.08~535.65倍的性能提升.  相似文献   

18.
Passive radar is one of the current research focuses. The implementation of the Chinese standard digital television terrestrial broadcasting (DTTB) creates a new opportunity for passive radar. DTTB system contains single-carrier and multicarrier application modes. In this paper, ambiguity functions of the DTTB signals in the single-carrier and multicarrier application modes are analyzed. Ambiguity function of the DTTB signal contains one main peak and many side peaks. The relative positions and amplitudes of the side peaks are derived and the reasons for the occurrence of the side peaks are obtained. The side peaks identification (SPI) algorithm is proposed for avoiding the false alarms caused by the side peaks. Experimental results show that the SPI algorithm can indentify all the side peaks without the power loss. This research provides the foundation for designing the DTTB based passive radar. Supported by the National Natural Science Foundation of China (Grant No. 60232010), the Ministerial Foundation of China (Grant No. A2220060039) and the National Natural Science Foundation of China for Distinguished Young Scholars (Grant No. 60625104)  相似文献   

19.
The use of dynamic dependence analysis spans several areas of software research including software testing, debugging, fault localization, and security. Many of the techniques devised in these areas require the execution of large test suites in order to generate profiles that capture the dependences that occurred between given types of program elements. When the aim is to capture direct and indirect dependences between finely granular elements, such as statements and variables, this process becomes highly costly due to: (1) the large number of elements, and (2) the transitive nature of the indirect dependence relationship.The focus of this paper is on computing dynamic dependences between variables, i.e., dynamic information flow analysis or DIFA. First, because the problem of tracking dependences between statements, i.e., dynamic slicing, has already been addressed by numerous researchers. Second, because DIFA is a more difficult problem given that the number of variables in a program is unbounded. We present an algorithm that, in the context of test suite execution, leverages the already computed dependences to efficiently compute subsequent dependences within the same or later test runs. To evaluate our proposed algorithm, we conducted an empirical comparative study that contrasted it, with respect to efficiency, to three other algorithms: (1) a naïve basic algorithm, (2) a memoization based algorithm that does not leverage computed dependences from previous test runs, and (3) an algorithm that uses reduced ordered binary decision diagrams (roBDDs) to maintain and manage dependences. The results indicated that our new DIFA algorithm performed considerably better in terms of both runtime and memory consumption.  相似文献   

20.
Hash tables, as a type of data indexing structure that provides efficient data access based on key values, are widely used in various computer applications, especially in system software, databases, and high-performance computing field that requires extremely high performance. In network, cloud computing and IoT services, hash tables have become the core system components of cache systems. However, with the large-scale increase in the amount of large-scale data, performance bottlenecks have gradually emerged in systems designed with a multi-core CPU as the core of the hash table structure. There is an urgent need to further improve the high performance and scalability of the hash tables. With the increasing popularity of general-purpose Graphic Processing Units (GPUs) and the substantial improvement of hardware computing capabilities and concurrency performance, various types of system software tasks with parallel computing as the core have been optimized on the GPU and have achieved considerable performance promotion. Due to the sparseness and randomness, using the existing parallel structure of the hash tables directly on the GPUs will inevitably bring high-frequency memory access and frequent bus data transmission, which affects the performance of the hash tables on the GPUs. This study focuses on the analysis of memory access, hit ratio, and index overhead of hash table indexes in the cache system. A hybrid access cache indexing framework CCHT (Cache Cuckoo Hash Table) adapted to GPU is proposed and provided. The cache strategy suitable to different requirements of hit ratios and index overheads allows concurrent execution of write and query operations, maximizing the use of the computing performance and concurrency characteristics of GPU hardware, reducing memory access and bus transferring overhead. Through GPU hardware implementation and experimental verification, CCHT has better performance than other cache indexing hash tables while ensuring cache hit ratios.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号