首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
The hardware complexity of hardware-only directory protocols in shared-memory multiprocessors has motivated many researchers to emulate directory management by software handlers executed on the compute processors, called software-only directory protocols.In this paper, we evaluate the performance and design trade-offs between these two approaches in the same architectural simulation framework driven by eight applications from the SPLASH-2 suite. Our evaluation reveals some common case operations that can be supported by simple hardware mechanisms and can make the performance of software-only directory protocols competitive with that of hardware-only protocols. These mechanisms aim at either reducing the software handler latency or hiding it by overlapping it with the message latencies associated with inter-node memory transactions. Further, we evaluate the effects of cache block sizes between 16 and 256 bytes as well as two different page placement policies. Overall, we find that a software-only directory protocol enhanced with these mechanisms can reach between 63% and 97% of the baseline hardware-only protocol performance at a lower design complexity.  相似文献   

2.
As the computational performance of microprocessors continues to grow through the integration of an increasing number of processing cores on a single die, the interconnection network has become the central subsystem for providing the communications infrastructure among the on-chip cores as well as to off-chip memory. Silicon nanophotonics as an interconnect technology offers several promising benefits for future networks-on-chip, including low end-to-end transmission energy and high bandwidth density of waveguides using wavelength division multiplexing. In this work, we propose the use of time-division-multiplexed distributed arbitration in a photonic mesh network composed of silicon micro-ring resonator based photonic switches, which provides round-robin fairness to setting up photonic circuit paths. Our design sustains over 10× more bandwidth and uses less power than the compared network designs. We also observe a 2× improvement in performance for memory-centric application traces using the MORE modeling system.  相似文献   

3.
Sanjeev  Sencun  Sushil   《Performance Evaluation》2002,49(1-4):21-41
In this paper, we present a new scalable and reliable key distribution protocol for group key management schemes that use logical key hierarchies (LKH) for scalable group rekeying. Our protocol called WKA-BKR is based upon two ideas—weighted key assignment and batched key retransmission—both of which exploit the special properties of LKH and the group rekey transport payload to reduce the bandwidth overhead of the reliable key delivery protocol. Using both analytic modeling and simulation, we compare the performance of WKA-BKR with that of other rekey transport protocols, including a recently proposed protocol based on proactive FEC. Our results show that for most network loss scenarios, the bandwidth used by WKA-BKR is lower than that of the other protocols.  相似文献   

4.
A methodology, called Subsystem Access Time (SAT) modeling, is proposed for the performance modeling and analysis of shared-bus multiprocessors. The methodology is subsystem-oriented because it is based on a Subsystem Access Time Per Instruction (SATPI) concept, in which we treat major components other than processors (e.g., off-chip cache, bus, memory, I/O) as subsystems and model for each of them the mean access time per instruction from each processor. The SAT modeling methodology is derived from the Customized Mean Value Analysis (CMVA) technique, which is request-oriented in the sense that it models the weighted total mean delay for each type of request processed in the subsystems. The subsystem-oriented view of the proposed methodology facilitates divide-and-conquer modeling and bottleneck analysis, which is rarely addressed previously. These distinguishing features lead to a simple, general, and systematic approach to the analytical modeling and analysis of complex multiprocessor systems. To illustrate the key ideas and features that are different from CMVA, an example performance model of a particular shared-bus multiprocessor architecture is presented. The model is used to conduct performance evaluation for throughput prediction. Thereby, the SATPIs of the subsystems are directly utilized to identify the bottleneck subsystem and find the requests or subsystem components that cause the bottleneck. Furthermore, the SATPIs of the subsystems are employed to explore the impact of several performance influencing factors, including memory latency, number of processors, data bus width, as well as DMA transfer  相似文献   

5.
The effect of noise on various protocols of secure quantum communication has been studied. Specifically, we have investigated the effect of amplitude damping, phase damping, squeezed generalized amplitude damping, Pauli type as well as various collective noise models on the protocols of quantum key distribution, quantum key agreement, quantum secure direct quantum communication and quantum dialogue. From each type of protocol of secure quantum communication, we have chosen two protocols for our comparative study: one based on single-qubit states and the other one on entangled states. The comparative study reported here has revealed that single-qubit-based schemes are generally found to perform better in the presence of amplitude damping, phase damping, squeezed generalized amplitude damping noises, while entanglement-based protocols turn out to be preferable in the presence of collective noises. It is also observed that the effect of noise depends upon the number of rounds of quantum communication involved in a scheme of quantum communication. Further, it is observed that squeezing, a completely quantum mechanical resource present in the squeezed generalized amplitude channel, can be used in a beneficial way as it may yield higher fidelity compared to the corresponding zero squeezing case.  相似文献   

6.
This paper investigates the potential performance of hierachical, cache-consistent multiprocessors. We have developed a mean-value queueing model that considers bus latency, shared memory latency and bus interference as the primary sources of performance degradation in the system. A key feature of the model is its choice of high-level input parameters that can be related to application program characteristics. Another important property of the model is that it is computationally efficient. Very large systems can be analyzed in a matter of seconds.

Results of the model show that system topology has an important effect on overall performance. We find optimal two-level, three-level and four-level topologies that distribute the bus traffic uniformly across all levels in the hierarchy. We provide processing power estimates for the optimal topologies, under a particular set of workload assumptions. For example, the optimal three-level topology supports 512 processors each with a peak processing rate of 4 MIPS, and provides an effective 1400 (1700) MIPS in processing power, if the buses operate at 20 (40) MHz. This result assumes 22% of the data references are to globally shared data and that the shared data is read on the average by a significant fraction of processors between write operations. Results of our study also indicate that for reasonably low cache miss rates (3% at level 0), and 20 MHz buses, the bus subnetwork saturates with processor speeds of 6–8 MIPS, at least for topologies of five or fewer levels. Finally, we present parametric results that indicate how performance is affected by one of the parameters that characterizes data sharing in the workload.  相似文献   


7.
The authors explore a testing approach where the concern for selecting the appropriate test input provided to the implementation under test (IUT) is separated as much as possible from the analysis of the observed output. Particular emphasis is placed on the analysis of the observed interactions of the IUT in order to determine whether the observed input/output trace conforms to the IUT's specification. The authors consider this aspect of testing with particular attention to testing of communication protocol implementations. Various distributed test architectures are used for this purpose, where partial input/output traces are observable by local observers at different interfaces. The error-detection power of different test configurations is determined on the basis of the partial trace visible to each local observer and their global knowledge about the applied test case. The automated construction of trace analysis modules from the formal specification of the protocol is also discussed. Different transformations of the protocol specification may be necessary to obtain the reference specification, which can be used by a local or global observer for checking the observed trace. Experience with the construction of an arbiter for the OSI (open systems interconnection) transport protocol is described  相似文献   

8.
This paper presents new schedulability tests for preemptive global fixed-priority (FP) scheduling of sporadic tasks on identical multiprocessor platform. One of the main challenges in deriving a schedulability test for global FP scheduling is identifying the worst-case runtime behavior, i.e., the critical instant, at which the release of a job suffers the maximum interference from the jobs of its higher priority tasks. Unfortunately, the critical instant is not yet known for sporadic tasks under global FP scheduling. To overcome this limitation, pessimism is introduced during the schedulability analysis to safely approximate the worst-case. The endeavor in this paper is to reduce such pessimism by proposing three new schedulability tests for global FP scheduling. Another challenge for global FP scheduling is the problem of assigning the fixed priorities to the tasks because no efficient method to find the optimal priority ordering in such case is currently known. Each of the schedulability tests proposed in this paper can be used to determine the priority of each task based on Audsley’s approach. It is shown that the proposed tests not only theoretically dominate but also empirically perform better than the state-of-the-art schedulability test for global FP scheduling of sporadic tasks.  相似文献   

9.
In this paper we present a cache coherence protocol formultistage interconnection network (MIN)-based multiprocessors with two distinct private caches:privateblocks caches (PCache) containing blocks private to a process andshared-blocks caches (SCache) containing data accessible by all processes. The architecture is extended by a coherence control bus connecting all shared-block cache controllers. Timing problems due to variable transit delays through the MIN are dealt with by introducingTransient states in the proposed cache coherence protocol. The impact of the coherence protocol on system performance is evaluated through a performance study of three phases. Assuming homogeneity of all nodes, a single-node queuing model (phase 3) is developed to analyze system performance. This model is solved for processor and coherence bus utilizations using the mean value analysis (MVA) technique with shared-blocks steady state probabilities (phase 1) and communication delays (phase 2) as input parameters. The performance of our system is compared to that of a system with an equivalent-sized unified cache and with a multiprocessor implementing a directory-based coherence protocol. System performance measures are verified through simulation.  相似文献   

10.
A survey of cache coherence schemes for multiprocessors   总被引:1,自引:0,他引:1  
Stenstrom  P. 《Computer》1990,23(6):12-24
Schemes for cache coherence that exhibit various degrees of hardware complexity, ranging from protocols that maintain coherence in hardware, to software policies that prevent the existence of copies of shared, writable data, are surveyed. Some examples of the use of shared data are examined. These examples help point out a number of performance issues. Hardware protocols are considered. It is seen that consistency can be maintained efficiently, although in some cases with considerable hardware complexity, especially for multiprocessors with many processors. Software schemes are investigated as an alternative capable of reducing the hardware cost  相似文献   

11.
The Dual Boundary Element Method (DBEM) has been presented as an effective numerical technique for the analysis of linear elastic crack problems [Portela A, Aliabadi MH. The dual boundary element method: effective implementation for crack problems. Int J Num Meth Engng 1992;33:1269–1287]. Analysis of large structural integrity problems may need the use of large computational resources, both in terms of CPU time and memory requirements. This paper reports a message-passing implementation of the DBEM formulation dealing with the analysis of crack growth in structures. We have analyzed the construction of the system and its resolution. Different data distribution techniques have been studied with several problems. Results in terms of scalability and load balance for these two stages are presented in this paper.  相似文献   

12.
A method of providing redundancy to a class of bus-based multiprocessor arrays is discusse.The reconfiguration is hierarchical,providing global spare replacement at the array level and local reconfiguration within the spare block,Results of yield analysis performed on a 32 processor array are presented.  相似文献   

13.
This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines  相似文献   

14.
In this paper we present a new parallel sorting algorithm suitable for implementation on a tightly coupled multiprocessor. The algorithm utilizes P processors to sort a set of N data items subdivided into M subsets. The performance of the algorithm is investigated, and the results of experiments carried out on the Balance 8000 multiprocessor are presented.  相似文献   

15.
16.
The task of writing basic system software for a multiprocessor system is simplified if the system is organized so that it may be programmed as a single virtual machine. This strategy is considered for tightly coupled real time systems with many processors. Software is written in a high level language supporting parallel execution (such as concurrent PASCAL, MODULA or ADA) without knowledge of the target hardware configuration. The Demos system is discussed as an example of such an approach.  相似文献   

17.
Critical Real-Time Embedded Systems require functional and timing validation to prove that they will perform their functionalities correctly and in time. For timing validation, a bound to the Worst-Case Execution Time (WCET) for each task is derived and passed as an input to the scheduling algorithm to ensure that tasks execute timely. Bounds to WCET can be derived with deterministic timing analysis (DTA) and probabilistic timing analysis (PTA), each of which relies upon certain predictability properties coming from the hardware/software platform beneath. In particular, specific hardware designs are needed for both DTA and PTA, which challenges their adoption by hardware vendors.This paper makes a step towards reconciling the hardware needs of DTA and PTA timing analyses to increase the likelihood of those hardware designs to be adopted by hardware vendors. In particular, we show how Time Division Multiple Access (TDMA), which has been regarded as one of the main DTA-compliant arbitration policies, can be used in the context of PTA and, in particular, of the industrially-friendly Measurement-Based PTA (MBPTA). We show how the execution time measurements taken as input for MBPTA need to be padded to obtain reliable and tight WCET estimates on top of TDMA-arbitrated hardware resources with no further hardware support. Our results show that TDMA delivers tighter WCET estimates than MBPTA-friendly arbitration policies, whereas MBPTA-friendly policies provide higher average performance. Thus, the best policy to choose depends on the particular needs of the end user.  相似文献   

18.
MUPPET is a problem-solving environment for scientific computing with message-based multiprocessors. It consists of four part—concurrent languages, programming environments, application environments and man-machine interfaces. The programming paradigm of MUPPET is based on parallel abstract machines and transformations between them. This paradigm allows the development of programs which are portable among multiprocessors with different interconnection topologies.

In this paper we discuss the MUPPET programming paradigm. We give an introduction to the language CONCURRENT MODULA-2 and the graphic specification system GONZO. The graphic specification system tries to introduce graphics as a tool for programming. It is also the basis for programming generation and transformation.  相似文献   


19.
Over the last decade, there has been a steady stream of pair programming studies. However, one significant area of pair programming that has not received its due attention is gender. Considering the fact that pair programming is one of the major human-centric software development paradigms, this is a gap that needs to be addressed. This empirical study conducted quantitative and qualitative analyses of different gender pair combinations within pair programming context. Using a pool of university programming course students as the experiment participants, the study examined three gender pair types: female–female, female–male, and male–male. The result revealed that there was no significant gender difference in the pair programming coding output. But there were significant differences in the levels of pair compatibility and communication between the same gender pair type: female–female and male–male, and the mixed gender pair type, female–male. The post-experiment comments provide additional insights and details about gender in pair interactions.  相似文献   

20.
The scope of this paper is to review and evaluate all constant round Group Key Agreement (GKA) protocols proposed so far in the literature. We have gathered all GKA protocols that require 1,2,3,4 and 5 rounds and examined their efficiency. In particular, we calculated each protocol’s computation and communication complexity and using proper assessments we compared their total energy cost. The evaluation of all protocols, interesting on its own, can also serve as a reference point for future works and contribute to the establishment of new, more efficient constant round protocols.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号