首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Partitioning of processors on a multiprocessor system involves logically dividing the system into processor partitions. Programs can be executed in the different partitions in parallel. Optimally setting the partition size can significantly improve the throughput of multiprocessor systems. The speedup characteristics of parallel programs can be defined by execution signatures. The execution signature of a parallel program on a multiprocessor system is the rate at which the program executes in the absence of other programs and depends upon the number of allocated processors, the specific architecture, and the specific program implementation. Based on the execution signatures, this paper analyzes simple Markovian models of dynamic partitioning. From the analysis, when there are at most two multiprocessor partitions, the optimal dynamic partition size can be found which maximizes throughput. Compared against other partitioning schemes, the dynamic partitioning scheme is shown to be the best in terms of throughput when thereconfiguration overhead is low. If the reconfiguration overhead is high, dynamic partitioning is to be avoided. An expression for the reconfiguration overhead threshold is derived. A general iterative partitioning technique is presented. It is shown that the technique gives maximum throughput forn partions.  相似文献   

2.
RPM enables rapid prototyping of different multiprocessor architectures. It uses hardware emulation for reliable design verification and performance evaluation. The major objective of the RPM project is to develop a common, configurable hardware platform to accurately emulate different MIMD systems with up to eight execution processors. Because emulation is orders of magnitude faster than simulation, an emulator can run problems with large data sets more representative of the workloads for which the target machine is designed. Because an emulation is closer to the target implementation than an abstracted simulation, it can accomplish more reliable performance evaluation and design verification. Finally, an emulator is a real computer with its own I/O; the code running on the emulator is not instrumented. As a result, the emulator looks exactly like the target machine (to the programmer) and can run several different workloads, including code from production compilers, operating systems, databases, and software utilities  相似文献   

3.
Commercial transaction processing applications are an important workload running on symmetric multiprocessor systems (SMPs). They differ dramatically from scientific, numeric-intensive, and engineering applications because they are I/O bound, and they contain more system software activities. Most SMP servers available in the market have been designed and optimized for scientific and engineering workloads. A major challenge of studying architectural effects on the performance of a commercial workload is the lack of easy access to large-scale and complex database engines running on a multiprocessor system with powerful I/O facilities. Experiments involving case studies have been shown to be highly time-consuming and expensive. In this paper, we investigate the feasibility of using queuing network models with the support of simulation to study the SMP architectural impacts on the performance of commercial workloads. We use the commercial benchmark TPC-C as the workload. A bus-based SMP machine is used as the target platform. Queueing network modeling is employed to characterize the TPC-C workload on the SMP. The system components such as processors, memory, the memory bus, I/O buses, and disks are modeled as service centers, and their effects on performance are analyzed. Simulations are conducted as well to collect the workload-specific parameters (model parameterization) and to verify the accuracy of the model. Our studies find that among disk-related parameters, the disk rotation latency affects the performance of TPC-C most significantly. Among I/O buses and number of disks, the number of I/O buses has the deepest impact on performance. This study also demonstrates that our modeling approach is feasible, cost-effective, and accurate for evaluating the performance of commercial workloads on SMPs, and it is complementary to the measurement-based experimental approaches.  相似文献   

4.
Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of multiprocessor file systems. The design of a high-performance multiprocessor file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of multiprocessor file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide multiprocessor file-system design  相似文献   

5.
Numerical experiments were conducted to find out the extent to which a Genetic Algorithm (GA) may benefit from a multiprocessor implementation, considering, on one hand, that analyses of individual designs in a population are independent of each other so that they may be executed concurrently on separate processors, and, on the other hand, that there are some operations in a GA that cannot be so distributed. The algorithm experimented with was based on a gaussian distribution rather than bit exchange in the GA reproductive mechanism, and the test case was a hub frame structure of up to 1080 design variables. The experimentation engaging up to 128 processors confirmed expectations of radical elapsed time reductions comparing to a conventional single processor implementation. It also demonstrated that the time spent in the nondistributable parts of the algorithm and the attendant cross-processor communication may have a very detrimental effect on the efficient utilization of the multiprocessor machine and on the number of processors that can be used effectively in a concurrent manner. Three techniques were devised and tested to mitigate that effect, resulting in efficiency increasing to exceed 99 percent. Of particular interest to the user, corresponding elapsed time compression factors approaching 128 are realized on 128 processors. Received October 18, 2000  相似文献   

6.
The development of computing systems with large numbers of processors has been motivated primarily by the need to solve large, complex problems more quickly than is possible with uniprocessor systems. Traditionally, multiprocessor systems have been uniprogrammed, i.e., dedicated to the execution of a single set of related processes, since this approach provides the fastest response for an individual program once it begins execution. However, if the goal of a multiprocessor system is to minimize average response time or to maximize throughput, then multiprogramming must be considered. In this paper, a model of a simple multiprocessor system with a two-program workload is reviewed; the model is then applied to an Intel iPSC/2 hypercube multiprocessor with a workload consisting of parallel wavefront algorithms for solving triangular systems of linear equations. Throughputs predicted by the model are compared with throughputs obtained experimentally from an actual system. The results provide validation for the model and indicate that significant performance improvements for multiprocessor systems are possible through multiprogramming.  相似文献   

7.
The paper presents a performance case study of parallel jobs executing in real multi user workloads. The study is based on a measurement based model capable of predicting the completion time distribution of the jobs executing under real workloads. The model constructed is also capable of predicting the effects of system design changes on application performance. The model is a finite state, discrete time Markov model with rewards and costs associated with each state. The Markov states are defined from real measurements and represent system/workload states in which the machine has operated. The paper places special emphasis on choosing the correct number of states to represent the workload measured. Specifically, the performance of computationally bound, parallel applications executing in real workloads on an Alliant FX/80 is evaluated. The constructed model is used to evaluate scheduling policies, the performance effects of multiprogramming overhead, and the scalability of the Alliant FX/8O in real workloads. The model identifies a number of available scheduling policies which would improve the response time of parallel jobs. In addition, the model predicts that doubling the number of processors in the current configuration would only improve response time for a typical parallel application by 25%. The model recommends a different processor configuration to more fully utilize extra processors. The paper also presents empirical results which validate the model created  相似文献   

8.
The speedup factor in real time simulation of dynamic systems using multiprocessor resources depends on: the architecture of the multiprocessor system, type of interconnection between parallel processors, numerical methods and techniques used for discretization and task assignment and scheduling policy. The minimization of the number of processors needed for real time simulation requires the minimization of processors times for interprocessor communications and efficient scheduling policy. Therefore, this article presents a methodology for the real time simulation of dynamic systems including a new pre-emptive static assignment and scheduling policy. The advantages of applying digital signal processor with parallel architecture, for example TMS320C40, in real time simulation have been described. Some important issues in real time architectures necessary for efficient multiprocessor real time simulations, such as multiple I/O channels, concurrent I/O and CPU processing, direct high speed interprocessor communications, fast context switching, multiple busses, multiple memories, and powerful arithmetic units are inherent to this processor. These features minimize interprocessor communication time and maximize sustained CPU performance.  相似文献   

9.
Disk arrays and shared-memory multiprocessors are new technologies that are rapidly becoming pervasive. They are complementary because disk arrays naturally balance the I/O workload by interleaving data across all disks while a shared-memory multiprocessor balances the processing workload across multiple processors. In this paper, we examine how disk arrays and shared memory multiprocessors lead to an effective method for constructing database machines for general-purpose complex query processing. We show that disk arrays can lead to cost-effective storage systems if they are configured from suitably small formfactor disk drives. We introduce the storage system metricdata temperature (IO/s/Gbyte) as a way to evaluate how well a disk configuration can sustain its workload, and we show that disk arrays can sustain the same data temperature as a more expensive mirrored-disk configuration. We use the metric to evaluate the performance of disk arrays in XPRS, an operational shared-memory multiprocessor database system being developed at the University of California, Berkeley.  相似文献   

10.
Using a directed acyclic graph (dag) model of algorithms, we investigate precedence-constrained multiprocessor schedules for the n×n×n directed mesh. This cubical mesh is fundamental, representing the standard algorithm for square matrix product, as well as many other algorithms. Its completion requires at least 3n-2 multiprocessor steps. Time-minimal multiprocessor schedules that use as few processors as possible are called processor-time-minimal. For the cubical mesh, such a schedule requires at least [3n2/4] processors. Among such schedules, one with the minimum period (i.e., maximum throughput) is referred to as a period-processor-time-minimal schedule. The period of any processor-time-minimal schedule for the cubical mesh is at least 3n/2 steps. This lower bound is shown to be exact by constructing, for n a multiple of 6, a period-processor-time-minimal multiprocessor schedule that can be realized on a systolic array whose topology is a toroidally connected n/2×n/2×3 mesh  相似文献   

11.
This paper presents the design and implementation of a parallelization framework and OpenMP runtime support in Intel® C++ & Fortran compilers for exploiting nested parallelism in applications using OpenMP pragmas or directives. We conduct the performance evaluation of two multimedia applications parallelized with OpenMP pragmas and compiled with the Intel C++ compiler on Hyper-Threading Technology (HT) enabled multiprocessor systems. The performance results show that the multithreaded code generated by the Intel compiler achieved a speedup up to 4.69 on 4 processors with HT enabled for five different input video sequences for the H.264 encoder workload, and a 1.28 speedup on an HT enabled single-CPU system and 1.99 speedup on an HT-enabled dual-CPU system for the audio–visual speech recognition workload. The performance gain due to exploiting nested parallelism for leveraging Hyper-Threading Technology is up to 70% for two multimedia workloads under different multiprocessor system configurations. These results demonstrate that hyper-threading benefits can be achieved by exploiting nested parallelism through Intel compiler and runtime system support for OpenMP programs.  相似文献   

12.
In this paper, we propose a new parallel clustering algorithm, named Parallel Bisecting k-means with Prediction (PBKP), for message-passing multiprocessor systems. Bisecting k-means tends to produce clusters of similar sizes, and according to our experiments, it produces clusters with smaller entropy (i.e., purer clusters) than k-means does. Our PBKP algorithm fully exploits the data-parallelism of the bisecting k-means algorithm, and adopts a prediction step to balance the workloads of multiple processors to achieve a high speedup. We implemented PBKP on a cluster of Linux workstations and analyzed its performance. Our experimental results show that the speedup of PBKP is linear with the number of processors and the number of data points. Moreover, PBKP scales up better than the parallel k-means with respect to the dimension and the desired number of clusters. This research was supported in part by AFRL/Wright Brothers Institute (WBI).  相似文献   

13.
Inverted file partitioning schemes in multiple disk systems   总被引:1,自引:0,他引:1  
Multiple-disk I/O systems (disk arrays) have been an attractive approach to meet high performance I/O demands in data intensive applications such as information retrieval systems. When we partition and distribute files across multiple disks to exploit the potential for I/O parallelism, a balanced I/O workload distribution becomes important for good performance. Naturally, the performance of a parallel information retrieval system using an inverted file structure is affected by the partitioning scheme of the inverted file. In this paper, we propose two different partitioning schemes for an inverted file system for a shared-everything multiprocessor machine with multiple disks. We study the performance of these schemes by simulation under a number of workloads where the term frequencies in the documents are varied, the term frequencies in the queries are varied, the number of disks are varied and the multiprogramming level is varied  相似文献   

14.
The availability of multicore processors and programmable NICs, such as TOEs (TCP/IP Offloading Engines), provides new opportunities for designing efficient network interfaces to cope with the gap between the improvement rates of link bandwidths and microprocessor performance. This gap poses important challenges related with the high computational requirements associated to the traffic volumes and wider functionality that the network interface has to support. This way, taking into account the rate of link bandwidth improvement and the ever changing and increasing application demands, efficient network interface architectures require scalability and flexibility. An opportunity to reach these goals comes from the exploitation of the parallelism in the communication path by distributing the protocol processing work across processors which are available in the computer, i.e. multicore microprocessors and programmable NICs.Thus, after a brief review of the different solutions that have been previously proposed for speeding up network interfaces, this paper analyzes the onloading and offloading alternatives. Both strategies try to release host CPU cycles by taking advantage of the communication workload execution in other processors present in the node. Nevertheless, whereas onloading uses another general-purpose processor, either included in a chip multiprocessor (CMP) or in a symmetric multiprocessor (SMP), offloading takes advantage of processors in programmable network interface cards (NICs). From our experiments, implemented by using a full-system simulator, we provide a fair and more complete comparison between onloading and offloading. Thus, it is shown that the relative improvement on peak throughput offered by offloading and onloading depends on the rate of application workload to communication overhead, the message sizes, and on the characteristics of the system architecture, more specifically the bandwidth of the buses and the way the NIC is connected to the system processor and memory. In our implementations, offloading provides lower latencies than onloading, although the CPU utilization and interrupts are lower for onloading. Taking into account the conclusions of our experimental results, we propose a hybrid network interface that can take advantage of both, programmable NICs and multicore processors.  相似文献   

15.
A methodology, called Subsystem Access Time (SAT) modeling, is proposed for the performance modeling and analysis of shared-bus multiprocessors. The methodology is subsystem-oriented because it is based on a Subsystem Access Time Per Instruction (SATPI) concept, in which we treat major components other than processors (e.g., off-chip cache, bus, memory, I/O) as subsystems and model for each of them the mean access time per instruction from each processor. The SAT modeling methodology is derived from the Customized Mean Value Analysis (CMVA) technique, which is request-oriented in the sense that it models the weighted total mean delay for each type of request processed in the subsystems. The subsystem-oriented view of the proposed methodology facilitates divide-and-conquer modeling and bottleneck analysis, which is rarely addressed previously. These distinguishing features lead to a simple, general, and systematic approach to the analytical modeling and analysis of complex multiprocessor systems. To illustrate the key ideas and features that are different from CMVA, an example performance model of a particular shared-bus multiprocessor architecture is presented. The model is used to conduct performance evaluation for throughput prediction. Thereby, the SATPIs of the subsystems are directly utilized to identify the bottleneck subsystem and find the requests or subsystem components that cause the bottleneck. Furthermore, the SATPIs of the subsystems are employed to explore the impact of several performance influencing factors, including memory latency, number of processors, data bus width, as well as DMA transfer  相似文献   

16.
Speedup and efficiency, two measures for performance of pipelined computers, are now used to evaluate performance of parallel algorithms for multiprocessor systems. However, these measures consider only the computation time and number of processors used and do not include the number of the communication links in the system. The author defines two new measures, cost effectiveness and time-cost effectiveness, for evaluating performance of a parallel algorithm for a multiprocessor system. From these two measures two characterization factors for multiprocessor systems are defined and used to analyze some well-known multiprocessor systems. It is found that for a given penalty function, every multiprocessor architecture has an optimal number of processors that produces maximum profit. If too many processors are used, the higher cost of the system reduces the profit obtained from the faster solution. On the other hand, if too few processors are used, the penalty paid for taking a longer time to obtain the solution reduces the profit  相似文献   

17.
Multicore processors are customary within current generation computing systems. The overall concept of general purpose processing however remains a challenge as architects must provide increased performance for each advancing generation without solely relying on transistor scaling and additional cache levels. Although architects have steered towards heterogeneity to increase the performance and efficiency for a variety of workloads, the fundamental issue of how a single core's architecture may be improved and applied to the multiprocessor domain remains. This work builds upon the concept of Configurable Computing Units (CCU) - a nuanced approach to processor architectures and microarchitectures, employing reconfigurable datapaths and task-based execution. This work improves upon the efficiency of CCUs by applying various new design techniques including branch prediction, variable configuration, an OpenMP programming model, and Berkeley Dwarf testing. Experimental results using Gem5 demonstrate that a single CCU core can achieve dual-core performance, with a 1.29x decrease in area overhead and 55% of the power consumption required by a conventional CPU.  相似文献   

18.
This work examines scheduling for a real-time multiprocessor (MAFT) in which both hard deadlines and fault-tolerance are necessary system components. A workload for this system consists of a set of concurrent dependent tasks, each with some execution frequency; tasks are also fully ordered by priority. Fault tolerance mechanisms include hardware-supported voting on computation results as well as on task starts, task completions, and branch conditions. The distributed agreement mechanism used on system-level decisions adds a variable threading delay to the run time of each copy of a task. These delays make current schedule verification techniques inapplicable. In the most general execution profile, each processor in the system runs a subset of the tasks, with different tasks possibly having different frequencies. In this work, however, we restrict attention to a special class of workloads, termed uni-schedule, in which each processor executes the entire task set, using the multiple processors to implement full redundancy. In addition, all tasks are assumed to have the same periodicity. Given these restrictions, we produce stable schedules consistent with the initial workload specifications. Algorithms are first given for uni-schedule workloads with no run-time branches, and then for uni-schedule workloads with branches.  相似文献   

19.
The Cydra 5 is a heterogeneous multiprocessor system that targets small work groups or departments of scientists and engineers. The two types of processors are functionally specialized for the different components of the work load found in a departmental setting. The Cydra 5 numeric processor, based on a directed-data-flow architecture, provides consistently high performance on a broader class of numerical computations. The interactive processors offload all nonnumeric work from the numeric processor, leaving it free to spend all its time on the numeric application. The I/O processors permit high-bandwidth I/O transitions with minimal involvement from the interactive or numeric processors. The system architecture and data-flow architecture are described. The numeric processor decisions and tradeoffs are examined, and the main memory system is discussed. Some reflections on the design issues are offered  相似文献   

20.
We present a new algorithm for implementing a concurrent B-tree on a multiprocessor. The algorithm replicates B-tree nodes on multiple processors (particularly nodes near the root of the tree) to eliminate bottlenecks caused by contention for a single copy of each node. In contrast to other replication or caching strategies that provide some form of coherence, the algorithm uses a novel replication strategy, calledmulti-version memory. Multi-version memory weakens the semantics of coherent caches by allowing readers to access “old versions” of data. As a result, readers can run in parallel with a writer. Using multi-version memory for the internal nodes of a B-tree reduces the synchronization requirements and delays associated with updating the internal nodes. Experiments comparing the B-tree algorithm based on multi-version memory to other algorithms based on coherent replication show that using multi-version memory enhances throughput significantly, even for small numbers of processors, and also allows throughput to scale with increasing numbers of processors long past the point where other algorithms saturate and start to thrash.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号