首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle.  相似文献   

2.
Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to different architectural features the optimal algorithm for CPUs might be suboptimal for GPUs. GPUs achieve high memory bandwidth by running thousands of threads, so it is not feasible to keep the working set of all threads in the cache in a naive implementation. In GPUs the unit of execution is a group of threads and in the presence of loops and branches, threads in a group have to follow the same execution path; if some threads diverge, then different paths are serialized. We study the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory. We evaluate the memory efficiency in terms of memory access pattern and achieved memory bandwidth for different parallelization methods. To reduce thread divergence, we split string matching into multiple steps. We evaluate the different matching algorithms in terms of average- and worst-case performance and compare them against state-of-the-art CPU and GPU libraries. Our experimental evaluation shows that thread and memory efficiency affect performance significantly and that our proposed methods outperform previous CPU and GPU algorithms in terms of raw performance and power efficiency. The Knuth–Morris–Pratt algorithm is a good choice for GPUs because its regular memory access pattern makes it amenable to several GPU optimizations.  相似文献   

3.
Hybrid CPU/GPU cluster recently has drawn lots of attention from high performance computing because of excellent execution performance and energy efficiency. Many supercomputing sites in the newest TOP 500 and Green 500 are built by hybrid CPU/GPU clusters instead of CPU clusters. However, the programming complexity of hybrid CPU/GPU clusters is so high such that most of users usually hesitate to move toward to this new cluster computing platform. To resolve this problem, we propose a distributed PTX virtual machine called BigGPU on heterogeneous clusters in this paper. As named, this virtual machine physically is a distributed system which is aimed at parallel re-compiling and executing the PTX codes by aggregating CPUs and GPUs available in a computational cluster. With the support of this virtual machine, users can regard a hybrid CPU/GPU as a single large-scale GPU. Consequently, they can develop applications by using only CUDA without combining MPI and multithreading APIs while can simultaneously use distributed CPUs and GPUs for resolving the same problem. Moreover, they need not handle the problem of load balance among heterogeneous processors and the constraints of device memory and thread configuration existing in physical GPUs because BigGPU supports large-scale virtual device memory space and thread configuration. On the other hand, we have evaluated the execution performance of BigGPU in this paper. Our experimental results have shown that BigGPU indeed can effectively exploit the computational power of CPUs and GPUs for enhancing the execution performance of user's CUDA programs.  相似文献   

4.
赖海光  黄皓  谢俊元 《计算机应用》2005,25(5):1141-1144
网络入侵检测系统(NIDS)通过捕获和分析网络数据包判断是否存在攻击行为。由于网络带宽越来越高,NIDS的处理能力越来越难以跟上网络的速度。该文提出了一种利用对称多处理器(SMP)提高NIDS处理能力的方法,通过多个CPU并行的处理网络数据包改善系统的性能。经过对NIDS处理过程的分析,设计了一种有效的并行处理结构,保证在不同CPU上运行的线程能够高度并行的执行。此外,该文提出的线程同步方式既保证了程序功能的正确,又避免了对共享资源的互斥访问,进一步提高了线程的并行度。实验证明,在具有双CPU的SMP结构上实现的NIDS的性能比单CPU系统提高了80%。  相似文献   

5.
We present a model checking-based method for verifying list-based concurrent set data structures. Concurrent data structures are notorious for being hard to get right and thus, their verification has received significant attention from the verification community. These data structures are unbounded in two dimensions: the list size is unbounded and an unbounded number of threads access them. Thus, their model checking requires abstraction to a model bounded in both the dimensions. We first show how the unbounded number of threads can be model checked by reduction to a finite model, while assuming a bounded number of list nodes. In particular, we leverage the CMP (CoMPositional) method which abstracts the unbounded threads by keeping one thread as is (concrete) and abstracting all the other threads to a single environment thread. Next, the method proceeds as a series of iterations where in each iteration the abstraction is model checked and, if a spurious counterexample is observed, refined. This is accomplished by the user by inspecting the returned counterexamples. If the user determines the returned counterexample to be spurious, she adds constraints to the abstraction in the form of lemmas to refine it. Upon addition, these lemmas are also verified for correctness as part of the CMP method. Thus, since these lemmas are verified as well, we show how some of the lemmas required for refinement of this abstract model can be mined automatically using an assertion mining tool, Daikon. Next, we show how the CMP method approach can be extended to the list dimension as well, to fully verify the data structures in both the dimensions. While it is possible to show correctness of these data structures for an unbounded number of threads by keeping one concrete thread and abstracting others, this is not directly possible in the list dimension as the nodes pointed to by the threads change during list traversal. Our method addresses this challenge by constructing an abstraction for which the concrete nodes, i.e., the nodes pointed to by the threads, can change as the thread pointers move with program execution. Further, our method also allows for refinement of this abstraction to prove properties of interest. We show the soundness of our method and establish its utility by model checking challenging concurrent list-based data structure examples.  相似文献   

6.
Noise and radiation-induced soft errors (transient faults) in computer systems have increased significantly over the last few years and are expected to increase even more as we move toward smaller transistor sizes and lower supply voltages. Fault detection and recovery can be achieved through redundancy. The emergence of chip multiprocessors (CMPs) makes it possible to execute redundant threads on a chip and provide relatively low-cost reliability. State-of-the-art implementations execute two copies of the same program as two threads (redundant multithreading), either on the same or on separate processor cores in a CMP, and periodically check results. Although this solution has favorable performance and reliability properties, every redundant instruction flows through a high-frequency complex out-of-order pipeline, thereby incurring a high power consumption penalty. This paper proposes mechanisms that attempt to provide reliability at a modest power and complexity cost. When executing a redundant thread, the trailing thread benefits from the information produced by the leading thread. We take advantage of this property and comprehensively study different strategies to reduce the power overhead of the trailing core in a CMP. These strategies include dynamic frequency scaling, in-order execution, and parallelization of the trailing thread.  相似文献   

7.
Model checking of safety properties can be scaled up by pooling the CPU and memory resources of multiple computers. As compute clusters containing 100s of nodes, with each node realized using multi-core (e.g., 2) CPUs will be widespread, a model checker based on the parallel (shared memory) and distributed (message passing) paradigms will more efficiently use the hardware resources. Such a model checker can be designed by having each node employ two shared memory threads that run on the (typically) two CPUs of a node, with one thread responsible for state generation, and the other for efficient communication, including (1) performing overlapped asynchronous message passing, and (2) aggregating the states to be sent into larger chunks in order to improve communication network utilization. We present the design details of such a novel model checking architecture called Eddy. We describe the design rationale, details of how the threads interact and yield control, exchange messages, as well as detect termination. We have realized an instance of this architecture for the Murphi modeling language. Called Eddy_Murphi, we report its performance over the number of nodes as well as communication parameters such as those controlling state aggregation. Nearly linear reduction of compute time with increasing number of nodes is observed. Our thread task partition is done in such a way that it is modular, easy to port across different modeling languages, and easy to tune across a variety of platforms. Supported in part by NSF award CNS-0509379 and SRC Contract 2005-TJ-1318.  相似文献   

8.
现代高能物理研究需要使用高能量的粒子加速器,加速器束流动力学模拟软件具有重要的实用意义. 介绍了一个3维基于MIC的异构直线加速器并行束流动力学模拟软件NEWBEAM-MIC的开发进展. 目的是使用最新的超级异构计算机提高束流动力学模拟软件的性能,更好地完成加速器的设计和优化工作. 这个软件模拟了DTL和SOLENOID加速器装置中粒子的运动过程. NEWBEAM-MIC是在NEWBEAM-CPU软件基础上,将粒子推进部分分配到MIC卡上运行,从而利用MIC多线程的优势使计算加速的. 通过实际测试,这个软件在天河二号上使用100 CPUs和100 MICs可以模拟109个粒子,其中DTL场力计算、SOLENOID场力计算和粒子推进三个部分均可以比仅使用100 CPUs的NEWBEAM软件有100倍以上的加速效果. 再考虑MIC卡上的多线程,对同样规模的粒子,使用100 CPUs 和 100 MICs,当MIC线程数开到最大(224)时,NEWBEAM-MIC可以比单线程串行计算方式加速10000倍以上. 这表明本文开发的基于MIC的异构软件可以很好地加速原有的CPU软件,发挥现有MIC异构超级计算机的潜在性能.  相似文献   

9.
In this paper we propose and evaluate a set of new strategies for the solution of three dimensional separable elliptic problems on CPU–GPU platforms. The numerical solution of the system of linear equations arising when discretizing those operators often represents the most time consuming part of larger simulation codes tackling a variety of physical situations. Incompressible fluid flows, electromagnetic problems, heat transfer and solid mechanic simulations are just a few examples of application areas that require efficient solution strategies for this class of problems. GPU computing has emerged as an attractive alternative to conventional CPUs for many scientific applications. High speedups over CPU implementations have been reported and this trend is expected to continue in the future with improved programming support and tighter CPU–GPU integration. These speedups by no means imply that CPU performance is no longer critical. The conventional CPU-control–GPU-compute pattern used in many applications wastes much of CPU’s computational power. Our proposed parallel implementation of a classical cyclic reduction algorithm to tackle the large linear systems arising from the discretized form of the elliptic problem at hand, schedules computing on both the GPU and the CPUs in a cooperative way. The experimental result demonstrates the effectiveness of this approach.  相似文献   

10.
This paper deals with a problem of finding valid solutions to systems of polynomial constraints. Although there have been several quite successful algorithms based on domain subdivision to resolve this problem, some major issues are still demanding further research. Prime obstacles in developing an efficient subdivision-based polynomial constraint solver are the exhaustive, although hierarchical, search of the zero-set in the parameter domain, which is computationally demanding, and their scalability in terms of the number of variables. In this paper, we present a hybrid parallel algorithm for solving systems of multivariate constraints by exploiting both the CPU and the GPU multicore architectures. We dedicate the CPU for the traversal of the subdivision tree and the GPU for the multivariate polynomial subdivision. By decomposing the constraint solving technique into two different components, hierarchy traversal and polynomial subdivision, each of which is more suitable to CPUs and GPUs, respectively, our solver can fully exploit the availability of hybrid, multicore architectures of CPUs and GPUs. Furthermore, our GPU-based subdivision method takes advantage of the inherent parallelism in the multivariate polynomial subdivision. We demonstrate the efficacy and scalability of the proposed parallel solver through several examples in geometric applications, including Hausdorff distance queries, contact point computations, surface–surface intersections, ray trap constructions, and bisector surface computations. In our experiments, the proposed parallel method achieves up to two orders of magnitude improvement in performance compared to the state-of-the-art subdivision-based CPU solver.  相似文献   

11.
Many-core accelerators are being more frequently deployed to improve the system processing capabilities. In such systems, application mapping must be enhanced to maximize utilization of the underlying architecture. Especially, in graphics processing units (GPUs), mapping kernels that are part of multi-kernel applications has a great impact on overall performance, since kernels may exhibit different characteristics on different CPUs and GPUs. While some kernels run faster on GPUs, others may perform better in CPUs. Thus, heterogeneous execution may yield better performance than executing the application only on a CPU or only on a GPU. In this paper, we investigate on two approaches: a novel profiling-based adaptive kernel mapping algorithm to assign each kernel of an application to the proper device, and a Mixed-Integer Programming (MIP) implementation to determine optimal mapping. We utilize profiling information for kernels on different devices and generate a map that identifies which kernel should run where in order to improve the overall performance of an application. Initial experiments show that our approach can efficiently map kernels on CPUs and GPUs, and outperforms CPU-only and GPU-only approaches.  相似文献   

12.
Data prefetching mechanisms are widely used for hiding memory latency in data intensive applications. They mask the speed gap between CPUs and their memory systems by preloading data into the CPU caches, where accessing them is by at least one order of magnitude faster. Pre-execution is a combined prefetching method, which executes a slice of the original code preloading the code and its data at the same time. Pre-execution is often mentioned in the literature, but according to our knowledge, it has not been formally defined yet. We fill this void by presenting the formal definition of speculative and non-speculative pre-execution, and derive a lightweight software-based strategy which accelerates the main working thread by introducing an adaptive, non-speculative pre-execution helper thread. This helper thread acts as a perfect predictor, calculates memory addresses, prefetches the data and consumes cache misses early. The adaptive automatic control allows the helper thread to configure itself in run-time for best performance. The method is directly applicable to any data intensive application without requiring hardware modifications. Our method was able to achieve an average speedup of 10–30% in a real-life application.  相似文献   

13.
核电厂重要厂用水和循环水过滤重要冷源系统中的压力信号由安全逻辑处理机柜内的两系CPU进行采集。安全逻辑机柜两系CPU的信号采集与运算周期同为固定的100ms,但两系CPU其运行是独立且不同步的,且每个CPU在每个周期内采集信号的时间点是固定不变的。当水锤波效应引起的压力波动产生小于100ms的信号被一系CPU采集,会立即非预期启动安全级电机和泵等设备。当两系CPU间逻辑执行不一致状态持续达40 s还会发出两系CPU不一致故障报警。通过分析研究,在CPU信号采集前端增加0.5s上升沿延时模块。结果表明:该研究解决了由于水锤波效应引起的压力波动所造成的闪发信号导致安全逻辑处理机柜内两系CPU不一致报警以及频繁非预期启动安全级电机和泵等设备的问题。  相似文献   

14.
In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone.  相似文献   

15.
多核处理机系统Cache管理技术研究现状   总被引:1,自引:0,他引:1       下载免费PDF全文
多核处理器的Cache结构设计和管理是微处理器设计领域的重要问题。当前主流的商用微处理器均采用共享最后一级Cache的系统结构,而片上最后一级Cache的性能通常对处理器的性能影响较大,因此共享Cache的管理问题成为当前研究热点。本文首先介绍当前主流多核处理器及其设计问题,然后介绍了共享Cache管理的三项重要技术:线程调度、NUCA和Cache划分,最后给出多核处理器Cache管理技术的发展方向。  相似文献   

16.
基于自适应节的多媒体流QoS保证的研究   总被引:9,自引:0,他引:9  
张占军  韩承德  杨学良 《计算机学报》2000,23(12):1320-1325
分布式多媒体计算机系统的一个关键问题是保证连续多媒体流的服务质量(QoS),大多数多媒体流要求的QoS是在一定范围内,即[QoSmin,QoSmax]。为了保证QoS,计算机系统和网络系统必须分配足够的CPU、I/O、内存和网络带宽资源。在多媒体流传输过程中,这些资源的可利用率是动态变化的,尤其是带宽资源,如Internet,文中提出了自适应节的概念,采用PID调节技术,在[QoSmin,QoSmax]内,动态调节QoS,自适应动态的可利用资源变化,既保证了多媒体流的QoS,又提高了资源的利用率。  相似文献   

17.
The parallel preconditioned conjugate gradient method (CGM) is used in many applications of scientific computing and often has a critical impact on their performance and energy consumption. This article investigates the energy-aware execution of the CGM on multi-core CPUs and GPUs used in an adaptive FEM. Based on experiments, an application-specific execution time and energy model is developed. The model considers the execution speed of the CPU and the GPU, their electrical power, voltage and frequency scaling, the energy consumption of the memory as well as the time and energy needed for transferring the data between main memory and GPU memory. The model makes it possible to predict how to distribute the data to the processing units for achieving the most energy efficient execution: the execution might deploy the CPU only, the GPU only or both simultaneously using a dynamic and adaptive collaboration scheme. The dynamic collaboration enables an execution minimising the execution time. By measuring execution times for every FEM iteration, the data distribution is adapted automatically to changing properties, e.g. the data sizes.  相似文献   

18.
It is expected that Chip Multiprocessor Systems (CMPs) will contain more and more cores in every new generation. However, applications for these systems do not scale at the same pace. In order to obtain a good CMP utilization several applications will need to coexist in the system and in those cases virtualization of the CMP system will become mandatory. In this paper we analyze two virtualization strategies at NoC-level aiming to isolate the traffic generated by each application to reduce or even eliminate interferences among messages belonging to different applications. The first model handles most interferences among messages with a virtual-channels (VCs) implementation reducing both execution time and network latency. However, using VCs results in area and power overhead due to the cost of control and buffer implementation. In contrast, the second model is based on the resource partitioning strategies which results in a space partitioning of the CMP chip in several regions. For this last model, Virtual-Regions (VR), we use a reconfiguration algorithm of the network that is able to dynamically adapt the network partitions in order to satisfy the application requirements. The paper shows a comparison of both models and identifies their main advantages and disadvantages. From our experimental results, we show that our proposal obtains in terms of execution time average improvements of 30% for parallel applications when compared to a baseline scenario. Moreover, when compared to a VCs implementation, our proposal improves the average execution time by 9% for parallel applications.  相似文献   

19.
沈霆  李明禄  翁楚良 《计算机工程》2010,36(20):244-246
Xen虚拟化环境没有考虑CPU的间歇性故障带来的影响。基于此,建立模拟CPU间歇性故障的时间模型,在该模型下未修改的Xen系统中的虚拟机会立刻崩溃。提出一种自适应的策略来改进Xen的CPU调度,该策略主动跟踪CPU的状态变化,将发生故障的CPU上的虚拟处理器迁移到可用的其他CPU上。实验结果表明,当CPU间歇性故障频繁发生时,应用该策略可以使虚拟机继续稳定地工作,性能平滑地降低。  相似文献   

20.
Recent computer systems and handheld devices are equipped with high computing capability, such as general purpose GPUs (GPGPU) and multi-core CPUs. Utilizing such resources for computation has become a general trend, making their availability an important issue for the real-time aspect. Discrete cosine transform (DCT) and quantization are two major operations in image compression standards that require complex computations. In this paper, we develop an efficient parallel implementation of the forward DCT and quantization algorithms for JPEG image compression using Open Computing Language (OpenCL). This OpenCL-based parallel implementation utilizes a multi-core CPU and a GPGPU to perform DCT and quantization computations. We demonstrate the capability of this design via two proposed working scenarios. The proposed approach also applies certain optimization techniques to improve the kernel execution time and data movements. We developed an optimal OpenCL kernel for a particular device using device-based optimization factors, such as thread granularity, work-items mapping, workload allocation, and vector-based memory access. We evaluated the performance in a heterogeneous environment, finding that the proposed parallel implementation was able to speed up the execution time of the DCT and quantization by factors of 7.97 and 8.65, respectively, obtained from 1024 × 1024 and 2084 × 2048 image sizes in 4:4:4 format.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号