首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
This is the first of a series of papers on the Genesis distributed-memory benchmarks, which were developed under the European ESPRIT research program. The benchmarks provide a standard reference Fortran77 uniprocessor version, a distributed memory. MIMD version, and in some cases a Fortran90 version suitable for SIMD computers. The problems selected all have a scientific origin (mostly from physics or theoretical chemistry), and range from synthetic code fragments designed to measure the basic hardware properties of the computer (especially communication and synchronisation overheads), through commonly used library subroutines, to full application codes. This first paper defines the methodology to be used to analyse the benchmark results, and gives an example of a fully analysed application benchmark from General Relativity (GR1). First, suitable absolute performance metrics are carefully defined, then the performance analysis treats the execution time and absolute performance as functions of at least two variables, namely the problem size and the number of proecssors. The theoretical predictions are compared with, or fitted to, the measured results, and then used to predict (with due caution) how the performance might scale for larger problems and more processors than were actually available during the benchmarking. Benchmark measurements are given primarily for the German SUPRENUM computer, but also for the IBM 3083J, Convex C210 and a Parsys Supernode with 32 T800-20 transputers.  相似文献   

3.
By using the principle of fixed-time benchmarking, it is possible to compare a wide range of computers, from a small personal computer to the most powerful parallel supercomputer, on a single scale. Fixed-time benchmarks promise greater longevity than those based on a particular problem size and are more appropriate for “grand challenge” capability comparison. We present the design of a benchmark, SLALOM, that adjusts automatically to the computing power available and corrects several deficiencies in various existing benchmarks: it is highly scalable, solves a real problem, includes input and output times, and can be run on parallel computers of all kinds, using any convenient language. The benchmark provides an estimate of the size of problem solvable on scientific computers. It also can be used to demonstrate a new source of superlinear speedup in parallel computers. Results that span six orders of magnitude for contemporary computers of various architectures are presented.  相似文献   

4.
The Genesis benchmark suite has been assembled to evaluate the performance of distributed-memory MIMD systems. The problems selected all have a scientific origin (mostly from physics or theoretical chemistry), and range from synthetic code fragments designed to measure the basic hardware properties of the computer (especially communication and synchronisation overheads), through commonly used library subroutines, to full application codes. This is the second of a series of papers on the Genesis distributed-memory benchmarks, which were developed under the European ESPRIT research program. Results are presented for the SUPRENUM and iPSC/860 computers when running the following benchmarks: COMMS1 (communications), TRANS1 (matrix transpose), FFT1 (fast Fourier transform) and QCD2 (conjugate gradient kernel). The theoretical predictions are compared with, or fitted to, the measured results, and then used to predict (with due caution) how the performance might scale for larger problems and more processors than were actually available during the benchmarking.  相似文献   

5.
Giladi  R. Ahitav  N. 《Computer》1995,28(8):33-42
Potential computer system users or buyers usually employ a computer performance evaluation technique only if they believe its results provide valuable information. System Performance Evaluation Cooperative (SPEC) measures are perceived to provide such information and are therefore the ones most commonly used. SPEC measures are designed to evaluate the performance of engineering and scientific workstations, personal vector computers, and even minicomputers and superminicomputers. Along with the Transaction Processing Council (TPC) measures for database I/O performance, they have become de facto industry standards, but do SPEC's evaluation outcomes actually provide added information value? In this article, we examine these measures by considering their structure, advantages and disadvantages. We use two criteria in our examination: are the programs used in the SPEC suite properly blended to reflect a representative mix of different applications, and are they properly synthesized so that the aggregate measures correctly rank computers by performance? We conclude that many programs in the SPEC suites are superfluous; the benchmark size can be reduced by more than 50%. The way the measure is calculated may cause distortion. Substituting the harmonic mean for the geometric mean used by SPEC roughly preserves the measure, while giving better consistency. SPEC measures reflect the performance of the CPU rather than the entire system. Therefore, they might be inaccurate in ranking an entire system. To remedy these problems, we propose a revised methodology for obtaining SPEC measures  相似文献   

6.
This paper reports the performance of a single node of the Hitachi SR8000 when using SPEC OMP2001 benchmarks. Each processing node of the SR8000 is a shared-memory parallel computer composed of eight scalar processors with pseudo-vector processing feature. We have run the all of the SPEC OMP2001 benchmarks on the SR8000. According to the results of this performance measurement, we found that the SR8000 has good scalability continuing up to 8 processors except for a few benchmark programs. The performance results demonstrate that the SR8000 achieves high performance especially for memory-intensive applications.  相似文献   

7.
Parallel computers offer a solution to improve the lengthy computation time of many conventional, sequential programs used in molecular biology. On a parallel computer, different pieces of the computation are performed simultaneously on different processors. LINKMAP is a sequential program widely used by scientists to perform genetic linkage analysis. We have converted LINKMAP to run on a parallel computer, using the machine-independent parallel programming language, Linda. Using the parallelization of LINKMAP as a case study, the paper outlines an approach to converting existing highly iterative programs to a parallel form. The paper describes the steps involved in converting the sequential program to a parallel program. It presents performance benchmarks comparing the sequential version of LINKMAP with the parallel version running on different parallel machines. The paper also discusses alternative approaches to the problem of "load balancing," making sure the computational load is shared as evenly as possible among the available processors.  相似文献   

8.
Recently, a number of advanced architecture machines have become commercially available. These new machines promise better cost performance than traditional computers, and some of them have the potential of competing with current supercomputers, such as the CRAY X-MP, in terms of maximum performance. This paper describes the methodology and results of a pilot study of the performance of a broad range of advanced architecture computers using a number of complete scientific application programs. The computers evaluated include:
  • 1 shared-memory bus architecture machines such as the Alliant FX/8, the Encore Multimax, and the Sequent Balance and Symmetry
  • 2 shared-memory network-connected machines such as the Butterfly
  • 3 distributed-memory machines such as the NCUBE, Intel and Jet Propulsion Laboratory (JPL)/Caltech hypercubes
  • 4 very long instruction word machines such as the Cydrome Cydra-5
  • 5 SIMD machines such as the Connection Machine
  • 6 ‘traditional’ supercomputers such as the CRAY X-MP, CRAY-2 and SCS-40.
Seven application codes from a number of scientific disciplines have been used in the study, although not all the codes were run on every machine. The methodology and guidelines for establishing a standard set of benchmark programs for advanced architecture computers are discussed. The CRAYs offer the best performance on the benchmark suite; the shared memory multiprocessor machines generally permitted some parallelism, and when coupled with substantial floating point capabilities (as in the Alliant FX/8 and Sequent Symmetry), provided an order of magnitude less speed than the CRAYs. Likewise, the early generation hypercubes studied here generally ran slower than the CRAYs, but permitted substantial parallelism from each of the application codes.  相似文献   

9.
Conte  T.M. Hwu  W.-M.W. 《Computer》1991,24(1):48-56
An abstract system of benchmark characteristics that makes it possible, in the beginning of the design stage, to design with benchmark performance in mind is presented. The benchmark characteristics for a set of commonly used benchmarks are then shown. The benchmark set used includes some benchmarks from the Systems Performance Evaluation Cooperative. The SPEC programs are industry-standard applications that use specific inputs. Processor, memory-system, and operating-system characteristics are addressed  相似文献   

10.
SPEC CPU2000: measuring CPU performance in the New Millennium   总被引:1,自引:0,他引:1  
Henning  J.L. 《Computer》2000,33(7):28-35
As computers and software have become more powerful, it seems almost human nature to want the biggest and fastest toy you can afford. But how do you know if your toy is tops? Even if your application never does any I/O, it's not just the speed of the CPU that dictates performance. Cache, main memory, and compilers also play a role. Software applications also have differing performance requirements. So whom do you trust to provide this information? The Standard Performance Evaluation Corporation (SPEC) is a nonprofit consortium whose members include hardware vendors, software vendors, universities, customers, and consultants. SPEC's mission is to develop technically credible and objective component- and system-level benchmarks for multiple operating systems and environments, including high-performance numeric computing, Web servers, and graphical subsystems. On 30 June 2000, SPEC retired the CPU95 benchmark suite. Its replacement is CPU2000, a new CPU benchmark suite with 19 applications that have never before been in a SPEC CPU suite. The article discusses how SPEC developed this benchmark suite and what the benchmarks do  相似文献   

11.
Benchmarks are vital tools in the performance measurement and evaluation of computer hardware and software systems. Standard benchmarks such as the TREC, TPC, SPEC, SAP, Oracle, Microsoft, IBM, Wisconsin, AS3AP, OO1, OO7, XOO7 benchmarks have been used to assess the system performance. These benchmarks are domain-specific in that they model typical applications and tie to a problem domain. Test results from these benchmarks are estimates of possible system performance for certain pre-determined problem types. When the user domain differs from the standard problem domain or when the application workload is divergent from the standard workload, they do not provide an accurate way to measure the system performance of the user problem domain. System performance of the actual problem domain in terms of data and transactions may vary significantly from the standard benchmarks. In this research, we address the issue of domain boundness and workload boundness which results in the ir-representative and ir-reproducible performance readings. We tackle the issue by proposing a domain-independent and workload-independent benchmark method which is developed from the perspective of the user requirements. We present a user-driven workload model to develop a benchmark in a process of workload requirements representation, transformation, and generation. We aim to create a more generalized and precise evaluation method which derives test suites from the actual user domain and application. The benchmark method comprises three main components. They are a high-level workload specification scheme, a translator of the scheme, and a set of generators to generate the test database and the test suite. The specification scheme is used to formalize the workload requirements. The translator is used to transform the specification. The generator is used to produce the test database and the test workload. In web search, the generic constructs are main common carriers we adopt to capture and compose the workload requirements. We determine the requirements via the analysis of literature study. In this study, we have conducted ten baseline experiments to validate the feasibility and validity of the benchmark method. An experimental prototype is built to execute these experiments. Experimental results demonstrate that the method is capable of modeling the standard benchmarks as well as more general benchmark requirements.  相似文献   

12.
Alexander  W.G. Wortman  D.B. 《Computer》1975,8(11):41-46
One use of performance measurement techniques is in the study of operational characteristics of programs written in high-level programming languages. Information derived from such studies can be used to construct benchmark programs and synthetic workloads,1,2detect inefficiencies in programming language implementation, and suggest possible improvements in the design of computers.3,9,10Our main interest is in the latter area: the discovery of primitive operations, implied by the semantics of a programming language, that can be added to the firmware or hardware of a computer to improve overall system performance. These computer architecture optimization techniques have been applied in several studies3,9and have been used commercially to design efficient pseudo machines for the Burroughs B1700.10,12  相似文献   

13.
Transactional memory (TM) is an approach to concurrency control that aims to make writing parallel programs both effective and simple. The approach has been initially proposed for nondistributed multiprocessor systems, but it is gaining popularity in distributed systems to synchronize tasks at large scales. Efficiency and scalability are often the key issues in TM research; thus, performance benchmarks are an important part of it. However, while standard TM benchmarks like the Stanford Transactional Applications for Multi‐Processing suite and STMBench7 are available and widely accepted, they do not translate well into distributed systems. Hence, the set of benchmarks usable with distributed TM systems is very limited, and must be padded with microbenchmarks, whose simplicity and artificial nature often makes them uninformative or misleading. Therefore, this paper introduces Helenos, a realistic, complex, and comprehensive distributed TM benchmark based on the problem of the Facebook inbox, an application of the Cassandra distributed store.  相似文献   

14.
NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.  相似文献   

15.
Given the increased availability of general purpose parallel computers two issues arise: One needs to compare the performance of the different available platforms using realistic examples, and it is necessary to write application software that can be ported easily in order to take advantage of different platforms. The authors address these issues from an applications point of view. They are interested in the use of general purpose parallel computers for simulation tasks needed during the design of very large scale integrated (VLSI) circuits. They characterize the simulation task as a useful benchmark and introduce a high level process view of parallel simulation that is helpful for deriving portable parallel programs. Details of the partitioning strategy and the simulation algorithm used in the application are given. They discuss their implementation on different parallel machines and give statistics of various experiments  相似文献   

16.
Benchmarks are heavily used in different areas of computer science to evaluate algorithms and tools. In program analysis and testing, open‐source and commercial programs are routinely used as benchmarks to evaluate different aspects of algorithms and tools. Unfortunately, many of these programs are written by programmers who introduce different biases, not to mention that it is very difficult to find programs that can serve as benchmarks with high reproducibility of results. We propose a novel approach for generating random benchmarks for evaluating program analysis and testing tools and compilers. Our approach uses stochastic parse trees, where language grammar production rules are assigned probabilities that specify the frequencies with which instantiations of these rules will appear in the generated programs. We implemented our tool for Java and applied it to generate a set of large benchmark programs of up to 5M lines of code each with which we evaluated different program analysis and testing tools and compilers. The generated benchmarks let us independently rediscover several issues in the evaluated tools. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

17.
Jones  C. 《Computer》1995,28(10):102-103
In software, “benchmarking” usually compares two companies' practices and results, but, occasionally, it involves sets of companies. For example, there are benchmark comparisons of industry software, such as insurance software, military software, telecommunication software, commercial software, and the like. In other domains, “benchmark” usually means the collection of a substantial body of quantitative data. Benchmark comparisons of various computers, for example, rate their relative performance in at least half a dozen categories. Historically, software benchmarks have been qualitative rather than quantitative. Even the Software Engineering Institute's Capability Maturity Model (SEI CMM) is essentially a qualitative benchmark that ranks company performance on a five-point scale that lacks quantification for specific quality and productivity levels  相似文献   

18.
This paper gives an overview of two related tools that we have developed to provide more accurate measurement and modelling of the performance of message-passing communication and application programs on distributed memory parallel computers. MPIBench uses a very precise, globally synchronised clock to measure the performance of MPI communication routines. It can generate probability distributions of communication times, not just the average values produced by other MPI benchmarks. This allows useful insights to be made into the MPI communication performance of parallel computers, and in particular how performance is affected by network contention. The Performance Evaluating Virtual Parallel Machine (PEVPM) provides a simple, fast and accurate technique for modelling and predicting the performance of message-passing parallel programs. It uses a virtual parallel machine to simulate the execution of the parallel program. The effects of network contention can be accurately modelled by sampling from the probability distributions generated by MPIBench. These tools are particularly useful on clusters with commodity Ethernet networks, where relatively high latencies, network congestion and TCP problems can significantly affect communication performance, which is difficult to model accurately using other tools. Experiments with example parallel programs demonstrate that PEVPM gives accurate performance predictions on commodity clusters. We also show that modelling communication performance using average times rather than sampling from probability distributions can give misleading results, particularly for programs running on a large number of processors.  相似文献   

19.
Moving data between processes has often been discussed as one of the major bottlenecks in parallel computing—there is a large body of research, striving to improve communication latency and bandwidth on different networks, measured with ping-pong benchmarks of different message sizes. In practice, the data to be communicated generally originates from application data structures and needs to be serialized before communicating it over serial network channels. This serialization is often done by explicitly copying the data to communication buffers. The message passing interface (MPI) standard defines derived datatypes to allow zero-copy formulations of non-contiguous data access patterns. However, many applications still choose to implement manual pack/unpack loops, partly because they are more efficient than some MPI implementations. MPI implementers on the other hand do not have good benchmarks that represent important application access patterns. We demonstrate that the data serialization can consume up to 80 % of the total communication overhead for important applications. This indicates that most of the current research on optimizing serial network transfer times may be targeted at the smaller fraction of the communication overhead. To support the scientific community, we extracted the send/recv-buffer access patterns of a representative set of scientific applications to build a benchmark that includes serialization and communication of application data and thus reflects all communication overheads. This can be used like traditional ping-pong benchmarks to determine the holistic communication latency and bandwidth as observed by an application. It supports serialization loops in C and Fortran as well as MPI datatypes for representative application access patterns. Our benchmark, consisting of seven micro-applications, unveils significant performance discrepancies between the MPI datatype implementations of state of the art MPI implementations. Our micro-applications aim to provide a standard benchmark for MPI datatype implementations to guide optimizations similarly to the established benchmarks SPEC CPU and Livermore Loops.  相似文献   

20.
基于图形处理器(graphics processing unit, GPU)加速设备的高性能计算机已经成为目前高性能计算领域的一个重要发展趋势.然而,在当前的GPU设备上开发高效的并行程序仍然是一件非常复杂的事情.针对这一问题,1)总结了影响GPU程序性能的5类关键性能指标;2)采用NVIDIA公司提供的CUPTI底层接口,设计并实现了一套GPU程序性能分析工具集,该工具集可以有效地分析GPU程序的性能行为;3)采用该工具集对著名的GPU评测程序集Rodinia中的17个程序和一个真实应用程序进行了负载特征分析.总结出常见性能瓶颈的典型原因,并给出一些开发高效GPU程序的建议.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号