首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
On large-scale multiprocessors, access to common memory is one of the key performance limiting factors. The shared-memory performance depends not only on the characteristics of the memory hierarchy itself, but also upon the characteristics of the memory address streams and the interaction between the two. We present a technique for multiprocessor workload construction and a family of artificial kernels, called MAD-kernels, to systematically investigate the behavior of the memory hierarchy. The measured performance is independent of any particular application or algorithm. The proposed methodology is demonstrated on two commercial shared-memory systems  相似文献   

2.
Disk arrays and shared-memory multiprocessors are new technologies that are rapidly becoming pervasive. They are complementary because disk arrays naturally balance the I/O workload by interleaving data across all disks while a shared-memory multiprocessor balances the processing workload across multiple processors. In this paper, we examine how disk arrays and shared memory multiprocessors lead to an effective method for constructing database machines for general-purpose complex query processing. We show that disk arrays can lead to cost-effective storage systems if they are configured from suitably small formfactor disk drives. We introduce the storage system metricdata temperature (IO/s/Gbyte) as a way to evaluate how well a disk configuration can sustain its workload, and we show that disk arrays can sustain the same data temperature as a more expensive mirrored-disk configuration. We use the metric to evaluate the performance of disk arrays in XPRS, an operational shared-memory multiprocessor database system being developed at the University of California, Berkeley.  相似文献   

3.
A transparent distributed shared memory (DSM) system must achieve complete transparency in data distribution, workload distribution, and reconfiguration respectively. The transparency of data distribution allows programmers to be able to access and allocate shared data using the same user interface as is used in shared-memory systems. The transparency of workload distribution and reconfiguration can optimize the parallelism at both the user-level and the kernel-level, and also improve the efficiency of run-time reconfiguration. In this paper, a transparent DSM system referred to as Teamster is proposed and is implemented for clustered symmetric multiprocessors. With the transparency provided by Teamster, programmers can exploit all the computing power of the clustered SMP nodes in a transparent way as they do in single SMP computer. Compared with the results of previous researches, Teamster can realize the transparency of cluster computing and obtain satisfactory system performance.  相似文献   

4.
Latency measures the delay caused by communication between processors and memory modules over the network in a parallel system. Using intensive measurements and simulation, we show that network latency forms a major obstacle to improving parallel computing performance and scalability. We present an experimental metric, using network latency to measure and evaluate the scalability of parallel programs and architectures. This latency metric is an extension to the isoefficiency function [Grama et al., IEEE Parallel Distrib. Technology 1, 3 (1993), 12-21] and iso-speed metric [Sun and Rover, IEEE Trans. Parallel Distrib. Systems 5, 6 (1994), 599-613]. We give a measurement method for using this latency metric, and report the experimental results of evaluating the scalabilities of several scientific computing algorithms on the KSR-1 shared-memory architecture. Our analysis and experiments show that the latency metric is a practical method to effectively predict and evaluate scalability based on measured latencies inherent in the program and the architecture.  相似文献   

5.
H‐norm is widely used in the analysis and synthesis of robust control, a field which continues to flourish and develop. However, H‐norm can only be used to measure the distance between two stable systems, not unstable systems. Sometimes, it is not appropriate to measure the gap between two systems. In this paper, a new metric, angular metric, defined in linear spaces of real rational matrices, is used to measure the distance of two systems with different dimensions. It is also used to measure the uncertainties and describe the performance specifications of the robust control system. In the framework of this metric, the robust stability margin is proposed to characterize the stability robustness of the closed‐loop system. When both the plant and the controller have uncertainties simultaneously, we introduce structural robust stability and prove the necessary and sufficient conditions of the robust stability of the feedback control system.  相似文献   

6.
崔鹏杰  袁野  李岑浩  张灿  王国仁 《软件学报》2022,33(3):1018-1042
图是描述实体间关系的重要数据结构,被广泛地应用于信息科学、物理学、生物学、环境生态学等重要的科学领域.现如今,随着图数据规模的不断增大,利用分布式系统来处理大图数据已经成为主流,出现了形如Pregel、GraphX、PowerGraph和Gemini等经典的分布式大图数据处理系统.然而,与当前先进的基于单机的图处理系统...  相似文献   

7.
Traditional implementation of sequential consistency in shared-memory systems requires memory accesses to be globally performed in program order.Based on an event ordering model for correct executions in shared-memory systems,this paper proposes and proves that out-of-order execution does not influence the correctness of an execution providing certain condition is met.Simulation results show that out-of-order execution proposed in this paper is an effective way to improve the performance of a sequentially consistent shared-memory system.  相似文献   

8.
Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking require a memory reclamation scheme that reclaims elements once they are no longer in use.The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes—quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation—using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect the memory reclamation performance.We discuss the consequences of our results for programmers and algorithm designers. Finally, we describe the use of one scheme, quiescent-state-based reclamation, in the context of an OS kernel—an execution environment which is well suited to this scheme.  相似文献   

9.
Mining association rules from large databases is very costly. We propose to develop parallel algorithms for this task on shared-memory multiprocessor (SMP). All proposed parallel algorithms for other paradigms follow the conventional level-wise approach: they need as many iterations as the length of the maximum large itemset. To make matter worse, they impose a synchronization in every iteration which would cause serious I/O contention on shared-memory parallel system. An adaptive asynchronous parallel mining algorithm APM has been proposed for SMP. All processors generate candidates dynamically and count itemset supports independently without synchronization. Two optimization techniques have been proposed for the reduction of database scanning and the number of candidates. The algorithm APM has been implemented on a Sun Enterprise 4000 shared-memory multiprocessor with 12 nodes. The experiments show that the optimizations have very good effects and APM has a substantial lead in performance over other proposed level-wise algorithms.  相似文献   

10.
Parallel computing scalability evaluates the extent to which parallel programs and architectures can effectively utilize increasing numbers of processors. In this paper, we compare a group of existing scalability metrics and evaluation models with an experimental metric which uses network latency to measure and evaluate the scalability of parallel programs and architectures. To provide insight into dynamic system performance, we have developed an integrated software environment prototype for measuring and evaluating multiprocessor scalability performance, called Scale-Graph. Scale-Graph uses a graphical instrumentation monitor to collect, measure and analyze latency-related data, and to display scalability performance based on various program execution patterns. The graphical software tool is X-Windows based and is currently implemented on standard workstations to analyze performance data of the KSR-1, a hierarchical ring-based shared-memory architecture  相似文献   

11.
Asynchronous group mutual exclusion   总被引:1,自引:1,他引:0  
Abstract. Mutual exclusion and concurrency are two fundamental and essentially opposite features in distributed systems. However, in some applications such as Computer Supported Cooperative Work (CSCW) we have found it necessary to impose mutual exclusion on different groups of processes in accessing a resource, while allowing processes of the same group to share the resource. To our knowledge, no such design issue has been previously raised in the literature. In this paper we address this issue by presenting a new problem, called Congenial Talking Philosophers, to model group mutual exclusion. We also propose several criteria to evaluate solutions of the problem and to measure their performance. Finally, we provide an efficient and highly concurrent distributed algorithm for the problem in a shared-memory model where processes communicate by reading from and writing to shared variables. The distributed algorithm meets the proposed criteria, and has performance similar to some naive but centralized solutions to the problem. Received: November 1998 / Accepted: April 2000  相似文献   

12.
Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempted to achieve the minimum completion time by distributing the workload as evenly as possible while minimizing the number of synchronization operations required. The authors consider a third dimension to the problem of loop scheduling on shared-memory multiprocessors: communication overhead caused by accesses to nonlocal data. They show that traditional algorithms for loop scheduling, which ignore the location of data when assigning iterations to processors, incur a significant performance penalty on modern shared-memory multiprocessors. They propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data. They compare the performance of this new algorithm to other known algorithms by using five representative kernel programs on a Silicon Graphics multiprocessor workstation, a BBN Butterfly, a Sequent Symmetry, and a KSR-1, and show that the new algorithm offers substantial performance improvements, up to a factor of 4 in some cases. The authors conclude that loop scheduling algorithms for shared-memory multiprocessors cannot afford to ignore the location of data, particularly in light of the increasing disparity between processor and memory speeds  相似文献   

13.
The behavior of n interacting processes synchronized by the "Time Warp" rollback mechanism is analyzed under the constraint that the total amount of memory to execute the program is limited. In Time Warp, a protocol called "cancelback" has been proposed to reclaim storage when the system runs out of memory. A discrete state, continuous time Markov chain model for Time Warp augmented with the cancelback protocol is developed for a shared memory system with n homogeneous processors and homogeneous workload with constant message population. The model allows one to predict speedup as the amount of available memory is varied. The performance predicted by the model is validated through performance measurements on an operational Time Warp system executing on a shared-memory multiprocessor using a workload similar to that in the model. It is observed that if the sequential simulation requires m message buffers, Time Warp with a small fraction of message buffers beyond m performs almost as well as Time Warp with unlimited memory.  相似文献   

14.
A human–machine interface framework provides general guidelines for what information should be put on an interface display screen. The framework is thus a first step towards the design of an effective and efficient interface. This paper reports on an experimental study of two proposed frameworks: the ecological interface design framework and the function–behaviour–state framework. In order to provide an unbiased comparative evaluation for both interfaces, the same application problem is used. The interfaces, based on each of the two frameworks, are implemented with as similar look-and-feel forms as possible in the presentation of information contents. Only the normal control operation and fault detection situations are considered at this stage of the study. In addition, in this study three categories of measures are used, namely: the performance measure; the physiological measure (the eye movement measure: the eye fixation and the pupil diameter change, in particular); and the subjective (or the user-rated) measure. The major results obtained from the study includes the following: (1) the information called the abstract function in the ecological interface design framework may not positively correlate to the performance improvement yet may increase the mental workload, (2) the function–behaviour–state framework seems more agreeable with the operator's mental model, and (3) operators may perform equally well with a function–behaviour–state interface but with a reduced mental workload. It is also found that the eye fixation measure is highly consistent with the performance measure and the subjective measure. The pupil diameter measure is found not to be significantly sensitive to the mental workload information; however, it appears sensitive to the mental workload information among individual participants and shows a consistent result with the other measures used.  相似文献   

15.
A faster divide-and-conquer algorithm for constructing delaunay triangulations   总被引:15,自引:0,他引:15  
Rex A. Dwyer 《Algorithmica》1987,2(1):137-151
An easily implemented modification to the divide-and-conquer algorithm for computing the Delaunay triangulation ofn sites in the plane is presented. The change reduces its (n logn) expected running time toO(n log logn) for a large class of distributions that includes the uniform distribution in the unit square. Experimental evidence presented demonstrates that the modified algorithm performs very well forn216, the range of the experiments. It is conjectured that the average number of edges it creates—a good measure of its efficiency—is no more than twice optimal forn less than seven trillion. The improvement is shown to extend to the computation of the Delaunay triangulation in theL p metric for 1<p.This research was supported by National Science Foundation Grants DCR-8352081 and DCR-8416190.  相似文献   

16.
Cluster Queue Structure for Shared-Memory Multiprocessor Systems   总被引:1,自引:0,他引:1  
Three basic structures have been proposed to organize the task queues for shared-memory multiprocessor systems: centralized, distributed, and hierarchical. Centralized structures are not suitable for massively parallel systems since the shared queue becomes a bottleneck for frequent enqueuing and dequeuing operations. Distributed structures have load imbalancing problem because of no support for workload sharing between queues. Hierarchical structures intend to combine the advantage of the previous two structures and eliminate their disadvantages. Unfortunately, we find load imbalancing still exists in the hierarchical structure, and has significant impact on system performance, particularly when the workload is heavy and irregular. After identifying the cause of this problem, we propose the use of a clustered structure in place of the hierarchical one. Analyzes and simulations show the proposed structure can provide better load balancing and less contention than the hierarchical one.  相似文献   

17.
The effectiveness of comprime factor techniques in L2 and L model reduction of unstable linear systems is analysed. Asymptotic estimates are given of the achievable error in the stable and unstable parts of the approximate system, measured in a number of different norms, some involving the associated Hankel operators. The chordal metric is introduced as a measure of approximation and is shown to yield the graph topology. Finally it is deduced that asymptotically optimal L2 and L convergence rates can be obtained for a large class of unstable systems.  相似文献   

18.
For the past decades computer engineers have focused on building high-performance and large-scale computer systems with low-cost. One of the examples is a distributed-memory computer system like a cluster, where fast processing nodes to use commodity processors are connected through a high speed network. But it is not easy to develop applications on this system, because a programmer must consider all data and control dependences between processes and program them explicitly. For alleviating this problem the distributed virtual shared-memory (DVSM) system has been proposed. It is well known that the performance of the DVSM system highly depends on the network’s performance and programming semantics, and currently its performance is very limited on a conventional network. Recently many advanced hardware-based interconnection technologies have been introduced, and one of them is the InfiniBand Architecture (IBA) which supports shared-memory programming semantics by means of remote direct-memory access (RDMA) and atomic operations. In this paper, we present the implementation of our InfiniBand-based DVSM system and analyze the performance of SPEC OMP benchmarks in detail by comparing with the DVSM based on the traditional network architecture and the hardware shared-memory multiprocessor (SMP) system. As experiment result, we show that our DVSM system to use full features of the IBA can improve the performance significantly over the IPoIB-based traditional system on the IBA, and furthermore the performance of one application on the IBA-based DVSM system is better than on the hardware SMP.  相似文献   

19.
This paper presents a unified architecture for public key cryptosystems that can support the operations of the Rivest–Shamir–Adleman cryptogram (RSA) and the elliptic curve cryptogram (ECC). A hardware solution is proposed for operations over finite fields GF(p) and GF(2p). The proposed architecture presents a unified arithmetic unit which provides the functions of dual-field modular multiplication, dual-field modular addition/subtraction, and dual-field modular inversion. A new adder based on the signed-digit (SD) number representation is provided for carry-propagated and carry-less operations. The critical path of the proposed design is reduced compared with previous full adder implementation methods. Experimental results show that the proposed design can achieve a clock speed of 1 GHz using 776 K gates in a 0.09 μm CMOS standard cell technology, or 150 MHz using 5227 CLBs in a Xilinx Virtex 4 FPGA. While the different technologies, platforms and standards make a definitive comparison difficult, based on the performance of our proposed design, we achieve a performance improvement of between 30% and 250% when compared with existing designs.  相似文献   

20.
A practical methodology for evaluating and comparing the performance of distributed memory Multiple Instruction Multiple Data (MIMD) systems is presented. The methodology determines machine parameters and program parameters separately, and predicts the performance of a given workload on the machines under consideration. Machine parameters are measured using benchmarks that consist of parallel algorithm structures. The methodology takes a workload-based approach in which a mix of application programs constitutes the workload. Performance of different systems are compared, under the given workload, using the ratio of their speeds. In order to validate the methodology, an example workload has been constructed and the time estimates have been compared with the actual runs, yielding good predicted values. Variations in the workload are analysed in terms of increase in problem sizes and changes in the frequency of particular algorithm groups. Utilization and scalability are used to compare the systems when the number of processors is increased. It has been shown that performance of parallel computers is sensitive to the changes in the workload and therefore any evaluation and comparison must consider a given user workload. Performance improvement that can be obtained by increasing the size of a distributed memory MIMD system depends on the characteristics of the workload as well as the parameters that characterize the communication speed of the parallel system.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号