共查询到20条相似文献,搜索用时 0 毫秒
1.
The computation of eigenvalues and eigenvectors of symmetric tridiagonal matrices arises frequently in applications; often as one of the steps in the solution of Hermitian and symmetric eigenproblems. While several accurate and efficient methods for the tridiagonal eigenproblem exist, their corresponding implementations usually target uni-processors or large distributed memory systems. Our new eigensolver MR3-SMP is instead specifically designed for multi-core and many-core general purpose processors, which today have effectively replaced uni-processors. We show that in most cases MR3-SMP is faster and achieves better speedups than state-of-the-art eigensolvers for uni-processors and distributed-memory systems. 相似文献
2.
Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we focus on the potential that these parallel applications have to exploit the performance of multi-core CPUs. Specifically, we analyze the method to systematically reuse and adapt the OpenCL code from GPUs to CPUs. We claim that this work is a necessary step for enabling inter-platform performance portability in OpenCL. 相似文献
3.
Jason Jianxun Ding Abdul Waheed Jingnan Yao Laxmi N. Bhuyan 《Journal of Parallel and Distributed Computing》2010
There is a growing trend to insert application intelligence into network devices. Processors in this type of Application Oriented Networking (AON) devices are required to handle both packet-level network I/O intensive operations as well as XML message-level CPU intensive operations. In this paper, we investigate the performance effect of symmetric multi-processing (SMP) via (1) hardware multi-threading, (2) uni-processor to dual-processor architectures, and (3) single to dual and quad core processing, on both packet-level and XML message-level traffic. We use AON systems based on Intel Xeon processors with hyperthreading, Pentium M based dual-core processors, and Intel’s dual quad-core Xeon E5335 processors. We analyze and cross-examine the SMP effect from both highlevel performance as well as processor microarchitectural perspectives. The evaluation results will not only provide insight to microprocessor designers, but also help system architects of AON types of device to select the right processors. 相似文献
4.
While the majority of CPUs now sold contain multiple computing cores, current grid computing systems either ignore the multiplicity of cores, or treat them as distinct, independent machines. The latter approach ignores the resource contention present between cores in a single CPU, while the former approach fails to take advantage of significant computing power. We provide a decentralized resource management framework for exploiting multi-core nodes to run multi-threaded applications in peer-to-peer grids. We present two new load-balancing schemes that explicitly account for the resource sharing and contention of multiple cores, and propose a parameterized performance prediction model that can represent a continuum of resource sharing among cores of a CPU. We use extensive simulation to confirm that our two algorithms match jobs with computing nodes efficiently, and balance load during the lifetime of the computing jobs. 相似文献
5.
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors. 相似文献
6.
Data warehouse workloads are crucial for the support of on-line analytical processing (OLAP). The strategy to cope with OLAP queries on such huge amounts of data calls for the use of large parallel computers. The trend today is to use cluster architectures that show a reasonable balance between cost and performance. In such cases, it is necessary to tune the applications in order to minimize the amount of I/O and communication, such that the global execution time is reduced as much as possible.
In this paper, we model and analyze the most up-to-date strategies for ad hoc star join query processing in a cluster of computers. We show that, for ad hoc query processing and assuming a limited amount of resources available, these strategies still have room for improvement both in terms of I/O and inter-node data traffic communication. Our analysis concludes with the proposal of a hybrid solution that improves these two aspects compared to the previous techniques, and shows near optimal results in a broad spectrum of cases. 相似文献
7.
We have designed Particle-in-Cell algorithms for emerging architectures. These algorithms share a common approach, using fine-grained tiles, but different implementations depending on the architecture. On the GPU, there were two different implementations, one with atomic operations and one with no data collisions, using CUDA C and Fortran. Speedups up to about 50 compared to a single core of the Intel i7 processor have been achieved. There was also an implementation for traditional multi-core processors using OpenMP which achieved high parallel efficiency. We believe that this approach should work for other emerging designs such as Intel Phi coprocessor from the Intel MIC architecture. 相似文献
8.
Hardware Transactional Memory (HTM) is an attractive design concept which simplifies parallel programming by shifting the problem of correct synchronization between threads to the underlying hardware memory system. 相似文献
9.
Jonas Larsson 《Mathematics and computers in simulation》2010,81(3):578-587
The processor evolution has reached a critical moment in time where it will soon be impossible to increase the frequency much further. Processor designers such as Motorola, Intel and IBM have all realised that the only way to improve the FLOP/Watt ratio is to develop multi-core devices. One of the most current examples of multi-core processors is the new Sony/Toshiba/IBM Cell/B.E. multi-core processor. For the suitability to run in parallel, Monte Carlo methods are often considered embarrassingly parallel. This paper describes how a common Monte Carlo based financial simulation can be calculated in parallel using the Cell/B.E. multi-core processor. The measured performance with the achieved multi-core speed-up is also presented. With the recent availability of this increasingly available technology, financial simulations can now be performed in a fraction of the time it used to. This can also be achieved with a limited power and volume budget using commercially available technology. The main challenge with multi-core devices is clearly the programmability. The work presented here describes how this challenge could be dealt with.A basic MPI library has been developed to handle the partitioning and communication of data. The thread creation follows a POSIX thread creation model. MPI together with POSIX make the application portable in between various multi-processor systems and multi-core devices. The conclusions made indicate that a function offload MPI implementation on the Cell/B.E. multi-core processor can efficiently be used to speed-up the Monte Carlo solution of financial simulations. The conclusions made herein are also applicable to other situations where an algorithm can be easily parallelized. 相似文献
10.
Jiayin Li Zhong Ming Meikang Qiu Gang Quan Xiao Qin Tianzhou ChenAuthor vitae 《Journal of Systems Architecture》2011,57(9):840-849
Multi-core technologies are widely used in embedded systems and the resource allocation is vita to guarantee Quality of Service (QoS) requirements for applications on multi-core platforms. For heterogeneous multi-core systems, the statistical characteristics of execution times on different cores play a critical role in the resource allocation, and the differences between the actual execution time and the estimated execution time may significantly affect the performance of resource allocation and cause system to be less robust. In this paper, we present an evaluation method to study the impacts of inaccurate execution time information to the performance of resource allocation. We propose a systematic way to measure the robustness degradation of the system and evaluate how inaccurate probability parameters may affect the performance of resource allocations. Furthermore, we compare the performance of three widely used greedy heuristics when using the inaccurate information with simulations. 相似文献
11.
Contemporary operating systems for single-ISA (instruction set architecture) multi-core systems attempt to distribute tasks equally among all the CPUs. This approach works relatively well when there is no difference in CPU capability. However, there are cases in which CPU capability differs from one another. For instance, static capability asymmetry results from the advent of new asymmetric hardware, and dynamic capability asymmetry comes from the operating system (OS) outside noise caused from networking or I/O handling. These asymmetries can make it hard for the OS scheduler to evenly distribute the tasks, resulting in less efficient load balancing. In this paper, we propose a user-level load balancer for parallel applications, called the ’capability balancer’, which recognizes the difference of CPU capability and makes subtasks share the entire CPU capability fairly. The balancer can coexist with the existing kernel-level load balancer without detrimenting the behavior of the kernel balancer. The capability balancer can fairly distribute CPU capability to tasks with very little overhead. For real workloads like the NAS Parallel Benchmark (NPB), we have accomplished speedups of up to 9.8% and 8.5% in dynamic and static asymmetries, respectively. We have also experienced speedups of 13.3% for dynamic asymmetry and 24.1% for static asymmetry in a competitive environment. The impacts of our task selection policies, FIFO (first in, first out) and cache, were compared. The use of the cache policy led to a speedup of 5.3% in overall execution time and a decrease of 4.7% in the overall cache miss count, compared with the FIFO policy, which is used by default. 相似文献
12.
Multicomputers for massively parallel processing will eventually employ billions of processing elements, each of which will be capable of communicating with every other processing element. A knowledge-based modelling and simulation environment (KBMSE) for investigating such multicomputer architecture at a discrete-event system level is described. The KBMSE implements the discrete-event system specification (DEVS) formalism in an object-oriented programming system of Scheme (a
dialect), which supports building models in a hierarchical, modular manner, a systems-oriented approach not possible in conventional simulation languages. The paper presents a framework for knowledge-based modelling and simulation by exemplifying modelling a hypercube multicomputer architecture in the KBMSE. The KBMSE has been tested on a variety of domains characterized by complex, hierarchical structures such as advanced multicomputer architectures, local area computer networks, intelligent multi-robot organizations, and biologically based life-support systems. 相似文献
13.
In this paper, a simulation framework that enables distributed numerical computing in multi-core shared-memory environments is presented. Using multiple threads allows a single memory image to be shared concurrently across cores but potentially introduces race conditions. Race conditions can be avoided by ensuring each core operates on an isolated memory block. This is usually achieved by running a different operating system process on each core, such as multiple MPI processes. However, we show that in many computational physics problems, memory isolation can also be enforced within a single process by leveraging spatial sub-division of the physical domain. A new spatial sub-division algorithm is presented that ensures threads operate on different memory blocks, allowing for in-place updates of state, with no message passing or creation of local variables during time stepping. Additionally, the developed framework controls task distribution dynamically ensuring an events based load balance. Results from fluid mechanics analysis using Smoothed Particle Hydrodynamics (SPH) are presented demonstrating linear performance with number of cores. 相似文献
14.
15.
Alberto Núñez Javier Fernández Rosa Filgueira Félix García Jesús Carretero 《Simulation Modelling Practice and Theory》2012,20(1):12-32
In this paper we propose a new simulation platform called SIMCAN, for analyzing parallel and distributed systems. This platform is aimed to test parallel and distributed architectures and applications. The main characteristics of SIMCAN are flexibility, accuracy, performance, and scalability. Thence, the proposed platform has a modular design that eases the integration of different basic systems on a single architecture. Its design follows a hierarchical schema that includes simple modules, basic systems (computing, memory managing, I/O, and networking), physical components (nodes, switches, …), and aggregations of components. New modules may also be incorporated as well to include new strategies and components. Also, a graphical configuration tool has been developed to help untrained users with the task of modelling new architectures. Finally, a validation process and some evaluation tests have been performed to evaluate the SIMCAN platform. 相似文献
16.
Seetharami Seelam Liana Fong Asser Tantawi John Lewars John Divirgilio Kevin Gildea 《Journal of Parallel and Distributed Computing》2013
System noise or Jitter is the activity of hardware, firmware, operating system, runtime system, and management software events. It is shown to disproportionately impact application performance in current generation large-scale clustered systems running general-purpose operating systems (GPOS). Jitter mitigation techniques such as co-scheduling jitter events across operating systems improve application performance but their effectiveness on future petascale systems is unknown. To understand if existing jitter mitigation solutions enable scalable petascale performance, we construct two complementary jitter models based on detailed analysis of system noise from the nodes of a large-scale system running a GPOS. We validate these two models using experimental data from a system consisting of 256 GPOS instances with 8192 CPUs. Based on our models, we project a minimum slowdown of 1.8%, 4.1%, and 6.5% for applications executing on a similar one petaflop system running 1024 GPOS instances and having global synchronization operations once every 100 ms, 10 ms, and 1 ms, respectively. Our projections indicate that–although existing mitigation solutions enable scalable petascale performance–additional techniques are required to contain the impact of jitter on multi-petaflop systems, especially for tightly synchronized applications. 相似文献
17.
基于Windows的综合网络性能分析系统 总被引:1,自引:3,他引:1
网络性能分析是网络设计和网络管理的重要环节。本系统设计了一个集成的综合性网络性能分析工具,具有收集、处理、分析、存储、显示等功能。对于网络拓扑、仿真系统和实际网络都可以分析。该系统基于Windows平台,分析结果以图表方式显示,方便了网络的设计和管理。 相似文献
18.
Wessam M. Hassanein Layali K. Rashid Moustafa A. Hammad 《International journal of parallel programming》2008,36(2):206-225
As information processing applications take greater roles in our everyday life, database management systems (DBMSs) are growing
in importance. DBMSs have traditionally exhibited poor cache performance and large memory footprints, therefore performing
only at a fraction of their ideal execution and exhibiting low processor utilization. Previous research has studied the memory
system of DBMSs on research-based simultaneous multithreading (SMT) processors. Recently, several differences have been noted
between the real hyper-threaded architecture implemented by the Intel Pentium 4 and the earlier SMT research architectures.
This paper characterizes the performance of a prototype open-source DBMS running TPC-equivalent benchmark queries on an Intel
Pentium 4 Hyper-Threading processor. We use hardware counters provided by the Pentium 4 to evaluate the micro-architecture
and study the memory system behavior of each query running on the DBMS. Our results show a performance improvement of up to
1.16 in TPC-C-equivalent and 1.26 in TPC-H-equivalent queries due to hyperthreading. 相似文献
19.
Executing traditional Message Passing Interface (MPI) applications on multi-core cluster balancing speed and computational efficiency is a difficult task that parallel programmers have to deal with. For this reason, communications on multi-core clusters ought to be handled carefully in order to improve performance metrics such as efficiency, speedup, execution time and scalability. In this paper we focus our attention on SPMD (Single Program Multiple Data) applications with high communication volume and synchronicity and also following characteristics such as: static, local and regular. This work proposes a method for SPMD applications, which is focused on managing the communication heterogeneity (different cache level, RAM memory, network, etc.) on homogeneous multi-core computing platform in order to improve the application efficiency. In this sense, the main objective of this work is to find analytically the ideal number of cores necessary that allows us to obtain the maximum speedup, while the computational efficiency is maintained over a defined threshold (strong scalability). This method also allows us to determine how the problem size must be increased in order to maintain an execution time constant while the number of cores are expanded (weak scalability) considering the tradeoff between speed and efficiency. This methodology has been tested with different benchmarks and applications and we achieved an average improvement around 30.35% of efficiency in applications tested using different problems sizes and multi-core clusters. In addition, results show that maximum speedup with a defined efficiency is located close to the values calculated with our analytical model with an error rate lower than 5% for the applications tested. 相似文献
20.
随着通信技术的不断发展和通信规模的不断壮大,为适应当前复杂社会形式需求,"警网"成为一种趋势。本文以OPNET仿真软件为工具,系统研究验证了"警网"搭建的可行性,并针对仿真结果对网络性能进行了分析。 相似文献