首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents a new cache consistency scheme for hierarchically structured shared-memory multiprocessors. The scheme is simple, fast and efficient, and it does not require a large amount of state information to be maintained. The scheme exploits the broadcast capability of these systems, but limits the extent of the broadcasts by means of a novel filtering mechanism. As a specific example, it is shown how the proposed cache consistency scheme can be implemented on the Hector multiprocessor architecture. Using trace-driven simulations, we demonstrate that the scheme is scalable and performs well for common applications.  相似文献   

2.
In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages.  相似文献   

3.
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called CHAOS, which helps users implement irregular programs on distributed-memory message-passing machines, such as the Paragon, Delta, CM-5 and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient index translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsumes the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirection arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life adaptive applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC. Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper demonstrates how CHAOS can be effectively used in such a framework. By embedding CHAOS primitives in the Syracuse Fortran 90D/HPF compiler, kernels taken from the CHARMM and DSMC codes have been automatically parallelized.  相似文献   

4.
Network-on-chip (NoC) communication architectures present promising solutions for scalable communication requests in large system-on-chip (SoC) designs. Intellectual property (IP) core assignment and mapping are two key steps in NoC design, significantly affecting the quality of NoC systems. Both are NP-hard problems, so it is necessary to apply intelligent algorithms. In this paper, we propose improved intelligent algorithms for NoC assignment and mapping to overcome the draw-backs of traditional intelligent algorithms. The aim of our proposed algorithms is to minimize power consumption, time, area, and load balance. This work involves multiple conflicting objectives, so we combine multiple objective optimization with intelligent algorithms. In addition, we design a fault-tolerant routing algorithm and take account of reliability using comprehensive performance indices. The proposed algorithms were implemented on embedded system synthesis benchmarks suite (E3S). Experimental results show the improved algorithms achieve good performance in NoC designs, with high reliability.  相似文献   

5.
Future chip multiprocessors (CMPs) may have hundreds to thousands of threads competing to access shared resources, and will require quality-of-service (QoS) support to improve system utilization. This paper introduces Globally-Synchronized Frames (GSF), a framework for providing guaranteed QoS in on-chip networks in terms of minimum bandwidth and maximum delay bound. The GSF framework can be easily integrated in a conventional virtual channel (VC) router without significantly increasing the hardware complexity. We exploit a fast on-chip barrier network to efficiently implement GSF. Performance guarantees are verified by analysis and simulation. According to our simulations, all concurrent flows receive their guaranteed minimum share of bandwidth in compliance with a given bandwidth allocation. The average throughput degradation of GSF on an 8×8 mesh network is within 10% compared to the conventional best-effort VC router.  相似文献   

6.
This paper proposes a novel leakage management technique for applications with producer-consumer sharing patterns. Although previous research has proposed leakage management techniques by turning off inactive cache blocks, these techniques can be further improved by exploiting the various run-time characteristics of target applications in CMPs. By exploiting particular access sequences observed in producer-consumer sharing patterns and the spatial locality of shared buffers, our technique enables a more aggressive turn-off of L2 cache blocks of these buffers. Experimental results using a CMP simulator show that our proposed technique reduces the energy consumption of on-chip L2 caches, a shared bus, and off-chip memory by up to 31.3% over the existing cache leakage power management techniques with no significant performance loss.  相似文献   

7.
In this paper, a comprehensive study is first conducted to investigate the effects of cache coherence protocols and cache replacement policies on the characteristics of NUCA in current many-core processors. The main focus of this study is to analyze the effects of coherence protocols and replacement policies on the vulnerability of caches. The outcomes of this analysis indicate two facts: (i) Differences in handling write operations play an important role to make distinction in favor of or against a cache coherence protocol; (ii) Near-optimal solutions for replacement problem, aimed at enhancing the performance, can also make positive influence on reduction of cache vulnerability factor. Based on the results of first step, two schemes are introduced to enhance the reliability of caches by applying some modification on the structures of cache coherence protocols and cache replacement policies. The first scheme tries to manage sharing of the dirty data items among different same-level caches. The second helps to give priority and more opportunity to old dirty blocks than clean blocks for replacement. The proposed schemes reveal about 18% improvement in MTTF, with negligible performance, bandwidth and energy consumption overhead compared to previous cache structures.  相似文献   

8.
Resource reclaiming schemes are typically applied in reservation-based real-time uniprocessor systems to support efficient reclaiming and sharing of computational resources left unused by early completing tasks, improving the response times of aperiodic and soft tasks in the presence of overruns. In this paper, we introduce a novel and efficient reclaiming algorithm, named M-CASH, for multiprocessor platforms. M-CASH leverages the resource reservation approach offered by the Multiprocessor CBS server offering significant improvements. The correctness of the algorithm is formally proven and its performance is evaluated through extensive synthetic simulations.
Marco CaccamoEmail:
  相似文献   

9.
Network-on-chip (NoC) are considered the next generation of communication infrastructure in embedded systems. In the platform-based design methodology, an application is implemented by a set of collaborative intellectual property (IP) blocks. The selection of the most suited set of IPs as well as their physical mapping onto the NoC infrastructure to implement efficiently the application at hand are two hard combinatorial problems that occur during the synthesis process of Noc-based embedded system implementation. In this paper, we propose an innovative preference-based multi-objective evolutionary methodology to perform the assignment and mapping stages. We use one of the well-known and efficient multi-objective evolutionary algorithms NSGA-II and microGA as a kernel. The optimization processes of assignment and mapping are both driven by the minimization of the required silicon area and imposed execution time of the application, considering that the decision maker’s preference is a pre-specified value of the overall power consumption of the implementation.  相似文献   

10.
We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity — similarly to distributed virtual shared memory (DVSM) systems —leaving simpler hardware to maintain shared memory coherence at a cache line granularity.

By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all-hardware designs.  相似文献   


11.
Hybrid computing systems (incorporating FPGAs, GPUs, etc.) have received considerable attention recently as an approach to significant performance gains in many problem domains. Deploying applications on these systems, however, has proven to be difficult and very labor intensive. In this paper we review the current state of practice for application development on hybrid systems. We also present our vision of the application development languages and tools that we believe would greatly benefit the process of designing, implementing, and deploying applications on hybrid systems.  相似文献   

12.
Designing efficient parallel algorithms in a message-based parallel computer should consider both time-space tradeoffs and computation-communication tradeoffs. In order to balance these tradeoffs and achieve the optimal performance of an algorith, one has to consider various design parameters such as the number of processors required and the size of partitions. In this paper, we demonstrate that, for certain data parallel algorithms, it is possible to determine these design parameters analytically. To serve as a basis for the discussions that follow, a simple model for the NCUBE hypercube computer is introduced. Using this model, we use two examples, array summation and matrix multiplication, to illustrate how their performance can be modeled. By optimizing these expressions, one is able to determine optimal design parameters which arrive at efficient execution. Experiments on a 64-node NCUBE verified the accuracy of the analytic results and are used to further support the discussions.This research was supported in part by the DARPA ACMP Project and in part by the NSF grant CCR-87-16833.  相似文献   

13.
In glueless shared-memory multiprocessors where cache coherence is usually maintained using a directory-based protocol, the fast access to the on-chip components (caches and network router, among others) contrasts with the much slower main memory. Unfortunately, directory-based protocols need to obtain the sharing status of every memory block before coherence actions can be performed. This information has traditionally been stored in main memory, and therefore these cache coherence protocols are far from being optimal. In this work, we propose two alternative designs for the last-level private cache of glueless shared-memory multiprocessors: the lightweight directory and the SGluM cache. Our proposals completely remove directory information from main memory and store it in the home node’s L2 cache, thus reducing both the number of accesses to main memory and the directory memory overhead. The main characteristics of the lightweight directory are its simplicity and the significant improvement in the execution time for most applications. Its drawback, however, is that the performance of some particular applications could be degraded. On the other hand, the SGluM cache offers more modest improvements in execution time for all the applications by adding some extra structures that cope with the cases in which the lightweight directory fails.  相似文献   

14.
The uniform memory hierarchy model of computation   总被引:9,自引:0,他引:9  
TheUniform Memory Hierarchy (UMH) model introduced in this paper captures performance-relevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies.A sequential computer's memory is modeled as a sequence M 0,M 1,... of increasingly large memory modules. Computation takes place inM 0. Thus,M 0 might model a computer's central processor, whileM 1 might be cache memory,M 2 main memory, and so on. For each moduleM u, a busB u connects it with the next larger module Mu+1. All buses may be active simultaneously. Data is transferred along a bus in fixed-sized blocks. The size of these blocks, the time required to transfer a block, and the number of blocks that fit in a module are larger for modules farther from the processor. The UMH model is parametrized by the rate at which the blocksizes increase and by the ratio of the blockcount to the blocksize. A third parameter, the transfer-cost (inverse bandwidth) function, determines the time to transfer blocks at the different levels of the hierarchy.UMH analysis refines traditional methods of algorithm analysis by including the cost of data movement throughout the memory hierarchy. Thecommunication efficiency of a program is a ratio measuring the portion of UMH running time during which M0 is active. An algorithm that can be implemented by a program whose communication efficiency is nonzero in the limit is said to becommunication- efficient. The communication efficiency of a program depends on the parameters of the UMH model, most importantly on the transfer-cost function. Athreshold function separates those transfer-cost functions for which an algorithm is communication-efficient from those that are too costly. Threshold functions for matrix transpose, standard matrix multiplication, and Fast Fourier Transform algorithms are established by exhibiting communication-efficient programs at the threshold and showing that more expensive transfer-cost functions are too costly.A parallel computer can be modeled as a tree of memory modules with computation occurring at the leaves. Threshold functions are established for multiplication ofN×N matrices using up to N2 processors in a tree with constant branching factor.  相似文献   

15.
16.
We present an issue of the dynamically reconfigurable hardware-software architecture which allows for partitioning networking functions on a SoC (System on Chip) platform. We address this issue as a partition problem of implementing network protocol functions into dynamically reconfigurable hardware and software modules. Such a partitioning technique can improve the co-design productivity of hardware and software modules. Practically, the proposed partitioning technique, which is called the ITC (Inter-Task Communication) technique incorporating the RT-IJC2 (Real-Time Inter-Job Communication Channel), makes it possible to resolve the issue of partitioning networking functions into hardware and software modules on the SoC platform. Additionally, the proposed partitioning technique can support the modularity and reuse of complex network protocol functions, enabling a higher level of abstraction of future network protocol specifications onto the SoC platform. Especially, the RT-IJC2 allows for more complex data transfers between hardware and software tasks as well as provides real-time data processing simultaneously for given application-specific real-time requirements. We conduct a variety of experiments to illustrate the application and efficiency of the proposed technique after implementing it on a commercial SoC platform based on the Altera’s Excalibur including the ARM922T core and up to 1 million gates of programmable logic.  相似文献   

17.
Video streaming is vital for many important applications such as distance learning, digital video libraries, and movie-on-demand. Since video streaming requires significant server and networking resources, caching has been used to reduce the demand on these resources. In this paper, we propose a novel collaboration scheme for video caching on overlay networks, called Overlay Caching Scheme (OCS), to further minimize service delays and loads placed on an overlay network for video streaming applications. OCS is not a centralized nor a hierarchical collaborative scheme. Despite its design simplicity, OCS effectively uses an aggregate storage space and capability of distributed overlay nodes to cache popular videos and serve nearby clients. Moreover, OCS is light-weight and adaptive to clients’ locations and request patterns. We also investigate other video caching techniques for overlay networks including both collaborative and non-collaborative ones. Compared with these techniques on topologies inspired from actual networks, OCS offers extremely low average service delays and approximately half the server load. OCS also offers smaller network load in most cases in our study.
Wanida PutthividhyaEmail:
  相似文献   

18.
We investigate how transactional memory can be adapted for embedded systems. We consider energy consumption and complexity to be driving concerns in the design of these systems and therefore adapt simple hardware transactional memory (HTM) schemes in our architectural design. We propose several different cache structures and contention management schemes to support HTM and evaluate them in terms of energy, performance, and complexity. We find that ignoring energy considerations can lead to poor design choices, particularly for resource-constrained embedded platforms. We conclude that with the right balance of energy efficiency and simplicity, HTM will become an attractive choice for future embedded system designs.  相似文献   

19.
3D chip multi-processors (3D CMPs) combine the advantages of 3D integration and the parallelism of CMPs, which are emerging as active research topics in VLSI and multi-core computer architecture communities. One significant potentiality of 3D CMPs is to exploit the diversity of integration processes and high volume of vertical TSV bandwidth to mitigate the well-known “Memory Wall” problem. Meanwhile, the 3D integration techniques are under the severe thermal, manufacture yield and cost constraints. Research on 3D stacking memory hierarchy explores the high performance and power/thermal efficient memory architectures for 3D CMPs. The micro-architectures of memories can be designed in the 3D integrated circuit context and integrated into 3D CMPs. This paper surveys the design of memory architectures for 3D CMPs. We summarize current research into two categories: stacking cache-only architectures and stacking main memory architectures for 3D CMPs. The representative works are reviewed and the remaining opportunities and challenges are discussed to guide the future research in this emerging area.  相似文献   

20.
Transactional Memory (TM) is a programmer friendly alternative to traditional lock-based concurrency. Although it intends to simplify concurrent programming, the performance of the applications still relies on how frequent they synchronize and the way they access shared data. These aspects must be taken into consideration if one intends to exploit the full potential of modern multicore platforms. Since these platforms feature complex memory hierarchies composed of different levels of cache, applications may suffer from memory latencies and bandwidth problems if threads are not properly placed on cores. An interesting approach to efficiently exploit the memory hierarchy is called thread mapping. However, a single fixed thread mapping cannot deliver the best performance when dealing with a large range of transactional workloads, TM systems and platforms. In this article, we propose and implement in a TM system a set of adaptive thread mapping strategies for TM applications to tackle this problem. They range from simple strategies that do not require any prior knowledge to strategies based on Machine Learning techniques. Taking the Linux default strategy as baseline, we achieved performance improvements of up to 64.4% on a set of synthetic applications and an overall performance improvement of up to 16.5% on the standard STAMP benchmark suite.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号