首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Cache coherence in shared-memory multiprocessor systems has been studied mostly from an architecture viewpoint, often by means of aggregating metrics. In many cases, aggregate events provide insufficient information for programmers to understand and optimize the coherence behavior of their applications. A better understanding would be given by source code correlations of not only aggregate events, but also finer granularity metrics directly linked to high-level source code constructs, such as source lines and data structures. In this paper, we explore a novel application-centric approach to studying coherence traffic. We develop a coherence analysis framework based on incremental coherence simulation of actual reference traces. We provide tool support to extract these reference traces and synchronization information from OpenMP threads at runtime using dynamic binary rewriting of the application executable. These traces are fed to ccSIM, our cache-coherence simulator. The novelty of ccSIM lies in its ability to relate low-level cache coherence metrics (such as coherence misses and their causative invalidations) to high-level source code constructs including source code locations and data structures. We explore the degree of freedom in interleaving data traces from different processors and assess simulation accuracy in comparison to metrics obtained from hardware performance counters. Our quantitative results show that: 1) Cache coherence traffic can be simulated with a considerable degree of accuracy for SPMD programs, as the invalidation traffic closely matches the corresponding hardware performance counters. 2) Detailed, high-level coherence statistics are very useful in detecting, isolating, and understanding coherence bottlenecks. We use ccSIM with several well-known benchmarks and find coherence optimization opportunities leading to significant reductions in coherence traffic and savings in wall-clock execution time  相似文献   

2.
随着多核处理器逐渐成为处理器发展的新趋势,为了持续提高程序性能,必须并行执行应用程序.传统的自动并行技术能够很好地并行科学计算应用中的规则循环,但对于含有大量函数调用和指针引用的不规则程序,目前还不能有效地对其实施并行.针对这一现状,文中提出了基于区域平均执行时间和数据依赖信息的可能并行区域识别方法来对一些不规则程序实施高效并行,主要贡献如下:(1)自动识别程序中的多种并行性,不仅包括传统并行性分析中的循环迭代间的细粒度并行性,而且也包括传统并行性分析尚不能有效处理的循环体和函数调用点间的粗粒度并行性.对于程序中蕴含的众多并行性,文中基于区域平均执行时间实施收益分析来选择合适的并行区域实施并行;(2)自动识别可能并行区域间数据依赖关系的数量、类型以及导致数据依赖关系的程序变量.基于文中的分析结果,作者使用面向行为的投机并行系统(behavior oriented parallelism)对SPEC2006中的4个测试用例实现了并行化.并行化后的程序在Intel和AMD多核处理器上分别得到了300%和260%的平均性能加速.  相似文献   

3.
目前配置的计算机服务器大量采用64位AMD Opteron和Intel Xeon两种处理器。Opteron和Xeon处理器在时钟频率、内存控制器和I/O连接等诸多方面有所不同,这些差异导致基于这两种处理器的计算机集群系统有不同的特点,其性能与具体使用的应用程序密切相关。在构建面向高性能科学计算的集群系统时,选择基于何种64位处理器最为合理是众多用户所关心的一个重要话题,针对这个问题,对基于AMD Opteron 252(2.6GHz)和Intel Xeon 3.6GHz(L2 cache:1M)处理器的计算机集群系统进行了一系列科学计算性能的测试和比较。  相似文献   

4.
Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU–GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect result. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.  相似文献   

5.
Intel Xeon Phi协处理器的指令集IMCI引入了硬件实现的vgather指令,旨在帮助512位SIMD寄存器访问非连续内存地址上的数据。然而实验结果显示,vgather很有可能成为应用在Xeon Phi协处理器上关键的性能瓶颈之一。基于以上结论,针对vgather的性能建模可以帮助用户深入地掌握和理解Xeon Phi协处理器的性能特性。在实验方法上,本文方法与现存的通过程序段内嵌入汇编代码进行数据统计不同,使用PAPI等性能分析工具直接收集硬件计数器的统计结果,作为模型的实验数据。本文的性能模型基于AGI事件次数和根据VPU_DATA_READ次数估算得出的vgather所导致的平均延迟构建而成。该模型能够对Xeon Phi应用代码中由vgather所导致的总延迟进行预测。最终,为了验证模型预测的准确性,将该模型应用在三维7点stencil应用代码上,预测结果显示,vgather耗时占计算总耗时的约40%。再将该结果与利用intrinsics指令去除vgather后的计算耗时进行了对比验证,结果显示模型预测准确。基于上述结论,采用硬件计数器的统计结果在Xeon Phi协处理器上针对vgather构建了性能模型。同时,通过与其他平台的vgather对比,认为该模型也可以应用在同样具备vgather的Intel CPU处理器平台上。  相似文献   

6.
Scheduling large-scale applications in heterogeneous distributed computing systems is a fundamental NP-complete problem that is critical to obtaining good performance and execution cost. In this paper, we address the scheduling problem of an important class of large-scale Grid applications inspired by the real world, characterized by a huge number of homogeneous, concurrent, and computationally intensive tasks that are the main sources of performance, cost, and storage bottlenecks. We propose a new formulation of this problem based on a cooperative distributed game-theory-based method applied using three algorithms with low time complexity for optimizing three important metrics in scientific computing: execution time, economic cost, and storage requirements. We present comprehensive experiments using simulation and real-world applications that demonstrate the effectiveness of our approach in terms of time and fairness compared to other related algorithms.  相似文献   

7.
一个基于硬件计数器的程序性能测试与分析工具   总被引:1,自引:0,他引:1  
在Intel P6系列处理器与Microsoft Windows NT平台上开发了一个工具软件PTracker,它利用处理器中的硬件性能计数器来获取程序性能数据,并结合机器体系结构参数对数据进行分析。它无需用户编程,与应用程序所使用的编程语言无关,使用很方便。它不仅能够通过性能计数器获得精确的性能参数,而且还能通过对测试得到的性能数据的分析,揭示程序高层次的性能特征,对程序性能评价与优化具有一定的指导作用。本文介绍了PTracker的技术背景、设计与系统实现,并给出了一个应用实例。  相似文献   

8.
Recently, real-time processing of image recognition is required for embedded applications such as automotive applications, robotics, entertainment, and so on. To realize real-time processing of image recognition on such systems we need optimized libraries for embedded processors. OpenCV is one of the most widely used libraries for computer vision applications and has many functions optimized for Intel processors, but no function is optimized for embedded processors. We present a parallel implementation of OpenCV library on the Cell Broadband Engine (Cell), which is one of the most widely used high performance embedded processors. Experimental result shows that most of the functions optimized for the Cell processor are faster than functions optimized for Intel Core 2 Duo E6850 3.00 GHz.  相似文献   

9.
As information processing applications take greater roles in our everyday life, database management systems (DBMSs) are growing in importance. DBMSs have traditionally exhibited poor cache performance and large memory footprints, therefore performing only at a fraction of their ideal execution and exhibiting low processor utilization. Previous research has studied the memory system of DBMSs on research-based simultaneous multithreading (SMT) processors. Recently, several differences have been noted between the real hyper-threaded architecture implemented by the Intel Pentium 4 and the earlier SMT research architectures. This paper characterizes the performance of a prototype open-source DBMS running TPC-equivalent benchmark queries on an Intel Pentium 4 Hyper-Threading processor. We use hardware counters provided by the Pentium 4 to evaluate the micro-architecture and study the memory system behavior of each query running on the DBMS. Our results show a performance improvement of up to 1.16 in TPC-C-equivalent and 1.26 in TPC-H-equivalent queries due to hyperthreading.  相似文献   

10.
The complexity of modern processors poses increasingly more difficult challenges to software optimization. Modern optimizing compilers have become essential tools for leveraging the power of recent processors by means of high-level optimizations to exploit multi-core platforms and single-instruction-multiple-data (SIMD) instructions, as well as advanced code generation to deal with microarchitectural performance aspects. Using the Intel® CoreTM 2 Duo processor and Intel Fortran/C++ compiler as a case study, this paper gives a detailed account of the sort of optimizations required to obtain high performance on modern processors.  相似文献   

11.
Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But they also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges involved in designing a Breadth-First Search (BFS) algorithm for the Cell/B.E. processor. The proposed methodology combines a high-level algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processor-specific optimizations. Using a fine-grained global coordination strategy derived by the Bulk-Synchronous Parallel (BSP) model, we have determined an accurate performance model that has guided the implementation and the optimization of our algorithm. Our experiments on a pre-production Cell/B.E. board running at 3.2 GHz, show almost linear speedups when using multiple synergistic processing elements, and an impressive level of performance when compared to other processors. On graphs which offer sufficient parallelism, the Cell/B.E. is typically an order of magnitude faster than conventional processors, such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, and custom-designed architectures, such as the MTA-2 and BlueGene/L.  相似文献   

12.
曙光E级原型机是我国“十三五”计划中3台原型系统之一,该系统采用异构计算架构,CPU和加速器选用AMD授权的国产海光处理器架构。除了采用基准测试程序对芯片进行测试外,为探究真实应用在该原型机上的性能,移植了激光等离子体应用GTC-P,对比了GTC-P在海光CPU和DCU与Intel 6148 CPU和NVIDIA V100 GPU上的性能,并在原型机的多结点上进行了扩展性分析。性能评估工作反映了高性能计算应用在曙光E级原型机上的实际运行性能。  相似文献   

13.
Automatic performance debugging of parallel applications includes two main steps: locating performance bottlenecks and uncovering their root causes for performance optimization. Previous work fails to resolve this challenging issue in two ways: first, several previous efforts automate locating bottlenecks, but present results in a confined way that only identifies performance problems with a priori knowledge; second, several tools take exploratory or confirmatory data analysis to automatically discover relevant performance data relationships, but these efforts do not focus on locating performance bottlenecks or uncovering their root causes.The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any prior knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; meanwhile, we present two searching algorithms to locate bottlenecks; second, on the basis of the rough set theory, we propose an innovative approach to automatically uncover root causes of bottlenecks; third, on the cluster systems with two different configurations, we use two production applications, written in Fortran 77, and one open source code—MPIBZIP2 (http://compression.ca/mpibzip2/), written in C++, to verify the effectiveness and correctness of our methods. For three applications, we also propose an experimental approach to investigating the effects of different metrics on locating bottlenecks.  相似文献   

14.
《Computer》2007,40(4):18-20
A new chipmaking approach marks the biggest change in transistor technology since the introduction of polysilicon-gate, metal-oxide-semiconductor. This could overcome the most substantial roadblock to continuing to use current production processes to shrink transistor elements and thereby increase chip performance. AMD, IBM, and Intel have announced plans to replace the silicon dioxide insulator layer of processors with new hafnium-based high-k materials, which increase charge transmission and reduce electrical leakage. To accommodate this, they replace the doped polysilicon used in transistor gates with a combination of metals. This solves a key problem: building transistors smaller horizontally so that manufacturers can pack more of them onto chips while increasing energy efficiency and continuing to use current chipmaking techniques, thereby avoiding expensive fabrication-plant changes. Many real advances are likely to come not from new materials but from the development of better muticore chips - which improve performance by having multiple cores run tasks in parallel rather than by increasing clock speed - as well as from the optimization of applications for these processors  相似文献   

15.
《Computers & Fluids》2006,35(8-9):910-919
This report presents a comprehensive survey of the effect of different data layouts on the single processor performance characteristics for the lattice Boltzmann method both for commodity “off-the-shelf” (COTS) architectures and tailored HPC systems, such as vector computers. We cover modern 64-bit processors ranging from IA32 compatible (Intel Xeon/Nocona, AMD Opteron), superscalar RISC (IBM Power4), IA64 (Intel Itanium 2) to classical vector (NEC SX6+) and novel vector (Cray X1) architectures. Combining different data layouts with architecture dependent optimization strategies we demonstrate that the optimal implementation strongly depends on the architecture used. In particular, the correct choice of the data layout could supersede complex cache-blocking techniques in our kernels. Furthermore our results demonstrate that vector systems can outperform COTS architectures by one order of magnitude.  相似文献   

16.
Oberman  S. Favor  G. Weber  F. 《Micro, IEEE》1999,19(2):37-48
The AMD-K6-2 microprocessor is the first implementation of AMD 3DNow!, a technology innovation for the x86 architecture that drives today's personal computers. 3DNow! technology is a set of 21 new instructions designed to open the traditional processing bottlenecks for floating-point-intensive and multimedia applications. Using these instructions, applications can implement more powerful solutions to create a more entertaining and productive PC platform. Examples of the type of improvements that 3DNow! technology enables are faster frame rates on high-resolution scenes, better physical modeling of real-world environments, sharper and more detailed 3D imaging, smoother video playback, and near theater-quality audio. Future AMD processors such as the AMD-K7, designed to operate at frequencies greater than 500 MHz, should provide even higher performance implementations of 3DNow! technology  相似文献   

17.
Today's massively parallel machines are typically message-passing systems consisting of hundreds or thousands of processors. Implementing parallel applications efficiently in this environment is a challenging task, and poor parallel design decisions can be expensive to correct. Tools and techniques that allow the fast and accurate evaluation of different parallelization strategies would significantly improve the productivity of application developers and increase throughput on parallel architectures. This paper investigates one of the major issues in building tools to compare parallelization strategies: determining what type of performance models of the application code and of the computer system are sufficient for a fast and accurate comparison of different strategies. The paper is built around a case study employing the performance prediction tool (PerPreT) to predict performance of the parallel spectral transform shallow water model code (PSTSWM) on the Intel Paragon. PSTSWM is a parallel application code that was designed to evaluate different parallel strategies for the spectral transform method as it is used in climate modeling and weather forecasting. Multiple parallel algorithms and algorithm variants are embedded in the code. PerPreT uses a relatively simple algebraic model to predict execution time for SPMD (single program multiple data) parallel applications. Applications are modeled through parameterized formulae for communication and computation, where the parameters include the problem size, the number of processors used to execute the program, and system characteristics (e.g. setup times for communication, link bandwidth and sustained computing performance per processor). In this paper we describe performance models that predict the performance of the different algorithms in PSTSWM accurately enough to allow them to be compared, establishing the feasibility of such a demanding application of performance modeling. We also discuss issues in generating and validating the performance models, emphasizing the practical importance of tools such as PerPreT in such studies. © 1998 John Wiley & Sons, Ltd.  相似文献   

18.
Heterogeneous (or hybrid) computing platforms with Intel Xeon Phi accelerators offer potential advantages of energy efficient, massively parallel computing, while supporting parallel programming models familiar for users of multicore CPUs. However, realizing this potential for real-world applications still remains a challenging issue. The main goal of this paper is the suitability assessment of offload-based programming environments for porting a real-life scientific application to hybrid platforms with Intel KNC and KNL accelerators, assuming no significant modifications of the application code. The main criterion of this assessment is the application performance. The evaluated environments include: 1) Intel Offload coupled with OpenMP, 2) OpenMP 4.0 and 3) OpenMP 4.5 Accelerator Models, and 4) hStreams Library with OpenMP. A real-life application dedicated to the numerical modeling of alloy solidification is used as a testbed in the assessment. An experimental evaluation of the four versions of the application code for a platform with KNC coprocessors shows that excluding OpenMP 4.0, the rest of them are able to adapt to expansion of available resources, however, with different efficiency. While the shortest execution time is achieved for Intel Offload, the high-level abstractions of hStreams contribute considerably to making porting and tuning the application easier, with low performance overheads in comparison to the low-level Intel Offload extension. Benchmarking the application performance and scalability on a platform with multiple KNL processors, using the Offload over Fabric technology with Intel Offload and OpenMP 4.5, concludes the assessment.  相似文献   

19.
In this paper, we present a fine-grained parallel implementation of the MPEG-2 video encoder an the Intel Paragon XP/S parallel computer. We use a data-parallel approach and exploit parallelism within each frame, unlike some of the previous approaches that employ multiple processing of several disjoint video sequences. This makes our encoder suitable for real-time applications where the complete video sequence may not be present on the disk and may become available on a frame-by-frame basis with time. The Express parallel programming environment is employed as the underlying message-passing system making our encoder portable across a wide range of parallel and distributed architectures. The encoder also provides control over various parameters such as the number of processors in each dimension, the size of the motion search window, buffer management, and bitrate. Moreover, it has the flexibility to allow the inclusion of fast and new algorithms for different stages of the codec into the program, replacing current algorithms. Comparisons of execution times, speedups, and frame encoding rates using different numbers of processors are provided. An analysis of frame data distribution among multiple processors is also presented. In addition, our study reveals the degrees of parallelism and bottlenecks in the various computational modules of the MPEG-2 algorithm. We have used two motion estimation techniques and five different video sequences for our experiments. Using maximum parallelism by dividing one block per processor, an encoding rate higher than 30 frames/s has been achieved.  相似文献   

20.
Rsim: simulating shared-memory multiprocessors with ILP processors   总被引:1,自引:0,他引:1  
The early 1990s saw several announcements of commercial shared-memory systems using processors that aggressively exploited instruction-level parallelism (ILP), including the MIPS R10000, Hewlett-Packard PA8000, and Intel Pentium Pro. These processors could potentially reduce memory read stalls by overlapping read latency with other operations, possibly changing the nature of performance bottlenecks in the system. The authors' experience with Rsim demonstrates that modeling ILP features is important even in shared-memory multiprocessor systems. In particular, current simple processor-based approximations cannot model significant performance effects for applications exhibiting parallel read misses. Further, recent shared-memory designs such as aggressive implementations of sequential consistency use the aggressive ILP-enhancing features of modern processors that simple processor-based simulators do not model. As microprocessor systems become more complex, the availability of shared infrastructure source code is likely to become increasingly crucial. The authors plan to release a new Rsim version shortly that will include instruction caches, TLBs, multimedia extensions, simultaneous multithreading, Rabbit fast simulation mode, and ports to Linux platforms  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号