首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
目前,Xen虚拟机调度算法均采用独立调度虚拟CPU的方式,而没有考虑虚拟机各虚拟CPU之间的协同调度关系,这会使虚拟机各个虚拟CPU之间产生很大的时钟中断数量偏差等问题,从而导致系统不稳定.为了提高系统的稳定性,基于Credit算法提出了一种比RCS(relaxed co-scheduling)算法更松弛的协同调度算法MRCS(more relaxed co-scheduling).该算法采用非抢占式协同调整方法将各个虚拟CPU相对运行的时间间隔控制在同步时间检测的上限门限值Tmax之内,同时利用同步队列中虚拟CPU优化选择调度方法和Credit算法的虚拟CPU动态迁移方法,能够更加及时地协同处理虚拟CPU,并且保证了各个物理CPU的负载均衡,有效地减少客户操作系统与VMM的环境切换次数,降低了系统开销.实验结果证明该方法不但保证了系统的稳定性,而且使系统性能得到一定程度的提升.虚拟机调度算法不仅影响虚拟机的性能,更会影响虚拟机的稳定性,致力于虚拟机调度算法的研究是一项非常有意义的工作.  相似文献   

2.
Speech compression or speech coding is inevitable for effective communication of speech signals in resource limited scenarios and researcher’s have been working on achieving lower and lower transmission bit rates (BR) without much compromise on the quality of speech. Medium BR hybrid speech coding schemes have gained much interest in the recent years with most of them based on CELP, the basic medium bit-rate coding scheme. In this work, we provide an insight to the capabilities of compressive sensing (CS) in speech processing and propose a novel idea in the quantized framework. Three major aspects demonstrated in this paper are (1) Inherent de-noising of noisy speech by the CS based coder along with compression (2) Quantization of CS measurements to achieve medium transmission bit-rates and (3) Enhancement of quality and compression performance of the coder with better sparse representations of speech using dictionaries. The results indicate that the proposed scheme offers better compression in comparison with basic Gaussian codebook CELP. The CS scheme has the added advantage of inherent noise suppression and provides more robustness to background noise in comparison with parameter extraction based medium bit-rate speech coding systems.  相似文献   

3.
高速无线数据服务的需求的增长,要求下一代无线网络显著提高其吞吐量。文章研究了自动重发请求(ARQ)机制,来满足这些新的要求。针对基于速率兼容纠错码的速率自适应I类混合ARQ机制和增量冗余重传II类混合ARQ机制,提出了一种系统框架,并对其在无线瑞利衰落信道上的性能进行了分析。数值结果表明,增量冗余重传(IRR)II类HARQ机制与速率自适应(RA)I类HARQ机制相比,有较高的吞吐量,而RAI类HARQ机制的延迟较小。  相似文献   

4.
The lattice Boltzmann method (LBM) is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large‐scale LBM simulations with increasing resolution and extending temporal range require massive high‐performance computing (HPC) resources, thus motivating us to port it onto modern many‐core heterogeneous supercomputers like Tianhe‐2. Although many‐core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating‐point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large‐scale heterogeneous system. In this paper, we parallelize and optimize the open source 3D multi‐phase LBM code openlbmflow on the Intel Xeon Phi (MIC) accelerated Tianhe‐2 supercomputer using a hybrid and heterogeneous MPI+OpenMP+Offload+single instruction, mulitple data (SIMD) programming model. With cache blocking and SIMD‐friendly data structure transformation, we dramatically improve the SIMD and cache efficiency for the single‐thread performance on both CPU and Phi, achieving a speedup of 7.9X and 8.8X, respectively, compared with the baseline code. To collaborate CPUs and Phi processors efficiently, we propose a load‐balance scheme to distribute workloads among intra‐node two CPUs and three Phi processors and use an asynchronous model to overlap the collaborative computation and communication as far as possible. The collaborative approach with two CPUs and three Phi processors improves the performance by around 3.2X compared with the CPU‐only approach. Scalability tests show that openlbmflow can achieve a parallel efficiency of about 60% on 2048 nodes, with about 400K cores in total. To the best of our knowledge, this is the largest scale CPU‐MIC collaborative LBM simulation for 3D multi‐phase flow problems. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

5.
Nowadays, it is an important trend in the system domain to use the software-based virtualization technology to build the execution environments (e.g., the Clouds). After introducing the virtualization layer, there exist two schedulers: One in the hypervisor and the other inside the Guest Operating System (GOS). To fully understand the virtualized system and identify the possible reasons for performance problems incurred by the virtualization technology, it is very important for the system administrators and engineers to know the scheduling behavior of the hypervisor, in addition to understanding the scheduler inside the GOS. In this paper, we develop a virtualization scheduling analyzer, called VSA, to analyze the trace data of the Xen virtual machine monitor. With VSA, one can easily obtain the scheduling data associated with virtual processors (i.e., VCPUs) and physical processors (i.e., PCPUs), and further conduct the scheduling analysis for a group of interacting VCPUs running in the same domain.  相似文献   

6.
High-performance parallel and scientific applications are composed of multiple processes running on distinct CPUs that communicate frequently. Due to the synchronization needs of such applications, performance is greatly hampered if their processes are not scheduled simultaneously on the CPUs. Implicit coscheduling (ICS) is a well-known technique to address this problem in multi-programmed clusters, however, traditional ICS schemes do not incorporate steps to adequately deal with priority boost conflicts, leading to significantly degraded performance. In this paper, we propose the use of runtime difference in contention across nodes to provide more sophisticated coscheduling decisions in response to the conflicts. We also present a novel coscheduling scheme termed PROC (Process ReOrdering-based Coscheduling) that adaptively regulates the scheduling sequence of conflicting processes based on the rescheduling latency of their correspondents in remote nodes. We perform extensive simulation-based experiments using both synthetic and realistic workloads to analyze the performance of PROC compared to alternatives such as local scheduling, a widely used batch scheduling, gang scheduling, and existing ICS schemes. The results show that all ICS schemes commonly experience priority boost conflicts, and that the proposed PROC significantly outperforms other ICS alternatives (or batch scheduling) by up to 50.4% (or 72.5%) in the average job response time. This improvement is achieved by reducing wasted idle time and spinning time without sacrificing fairness.
Seung-Ryoul MaengEmail:
  相似文献   

7.
Hybrid CPU/GPU cluster recently has drawn lots of attention from high performance computing because of excellent execution performance and energy efficiency. Many supercomputing sites in the newest TOP 500 and Green 500 are built by hybrid CPU/GPU clusters instead of CPU clusters. However, the programming complexity of hybrid CPU/GPU clusters is so high such that most of users usually hesitate to move toward to this new cluster computing platform. To resolve this problem, we propose a distributed PTX virtual machine called BigGPU on heterogeneous clusters in this paper. As named, this virtual machine physically is a distributed system which is aimed at parallel re-compiling and executing the PTX codes by aggregating CPUs and GPUs available in a computational cluster. With the support of this virtual machine, users can regard a hybrid CPU/GPU as a single large-scale GPU. Consequently, they can develop applications by using only CUDA without combining MPI and multithreading APIs while can simultaneously use distributed CPUs and GPUs for resolving the same problem. Moreover, they need not handle the problem of load balance among heterogeneous processors and the constraints of device memory and thread configuration existing in physical GPUs because BigGPU supports large-scale virtual device memory space and thread configuration. On the other hand, we have evaluated the execution performance of BigGPU in this paper. Our experimental results have shown that BigGPU indeed can effectively exploit the computational power of CPUs and GPUs for enhancing the execution performance of user's CUDA programs.  相似文献   

8.
Distributed shared memory has increasingly become a desirable programming model on which to program multicomputer systems. Such systems strike a balance between the performance attainable in distributed-memory multiprocessors and the ease of programming on shared-memory systems. In shared-memory systems, concurrent tasks communicate through shared variables, and synchronization of access to shared data is an important issue. Semaphores have been traditionally used to provide this synchronization. In this paper we propose a decentralized scheme to support semaphores in a virtual shared-memory system. Our method of grouping semaphores into semaphore pages and caching a semaphore at a processor on demand eliminates the reliability problems and bottlenecks associated with centralized schemes. We compare the performance of our scheme with a centralized implementation of semaphores and conclude that our system performs better under high semaphore access rates as well as larger numbers of processors.  相似文献   

9.
In many environments, rather than minimizing message latency or maximizing network performance, the ability to survive beyond the failure of individual network components is the main issue of interests. The nature of Wormhole Switching (WS) leads to high network throughput and low message latencies. However, in the vicinity of faulty regions, these behaviors cause rapid congestion, provoking the network becomes deadlocked. While techniques such as adaptive routing can alleviate the problem, they cannot completely solve the problem. Thus, there have been extreme studies on other types of switching mechanisms in networking and multicomputers communities. In this paper, we present a general mathematical model to assess the relative performance merits of three well-known fault-tolerant switching methods in tori, namely Scouting Switching (SS), Pipelined Circuit Switching (PCS), and Circuit Switching (CS). We have carried out extensive simulation experiments, the results of which are used to validate the proposed analytical models. We have also conducted an extensive comparative performance analysis, by means of analytical modeling, of SS, PCS, and CS under various working conditions. The analytical results reveal that SS shows substantial performance improvements for low to moderate failure rates over PCS and CS, which achieves close to WS performance. PCS can provide superior performance over CS and behaves the same or in some occasions worse than SS, under light and moderate traffic, especially with the same hardware requirements.  相似文献   

10.
Multicore processors are widely used in today’s computer systems. Multicore virtualization technology provides an elastic solution to more efficiently utilize the multicore system. However, the Lock Holder Preemption (LHP) problem in the virtualized multicore systems causes significant CPU cycles wastes, which hurt virtual machine (VM) performance and reduces response latency. The system consolidates more VMs, the LHP problem becomes worse. In this paper, we propose an efficient consolidation-aware vCPU (CVS) scheduling scheme on multicore virtualization platform. Based on vCPU over-commitment rate, the CVS scheduling scheme adaptively selects one algorithm among three vCPU scheduling algorithms: co-scheduling, yield-to-head, and yield-to-tail based on the vCPU over-commitment rate because the actions of vCPU scheduling are split into many single steps such as scheduling vCPUs simultaneously or inserting one vCPU into the run-queue from the head or tail. The CVS scheme can effectively improve VM performance in the low, middle, and high VM consolidation scenarios. Using real-life parallel benchmarks, our experimental results show that the proposed CVS scheme improves the overall system performance while the optimization overhead remains low.  相似文献   

11.
In this paper, we present hybrid weighted essentially non-oscillatory (WENO) schemes with several discontinuity detectors for solving the compressible ideal magnetohydrodynamics (MHD) equation. Li and Qiu (J Comput Phys 229:8105–8129, 2010) examined effectiveness and efficiency of several different troubled-cell indicators in hybrid WENO methods for Euler gasdynamics. Later, Li et al. (J Sci Comput 51:527–559, 2012) extended the hybrid methods for solving the shallow water equations with four better indicators. Hybrid WENO schemes reduce the computational costs, maintain non-oscillatory properties and keep sharp transitions for problems. The numerical results of hybrid WENO-JS/WENO-M schemes are presented to compare the ability of several troubled-cell indicators with a variety of test problems. The focus of this paper, we propose optimal and reliable indicators for performance comparison of hybrid method using troubled-cell indicators for efficient numerical method of ideal MHD equations. We propose a modified ATV indicator that uses a second derivative. It is advantageous for differential discontinuity detection such as jump discontinuity and kink. A detailed numerical study of one-dimensional and two-dimensional cases is conducted to address efficiency (CPU time reduction and more accurate numerical solution) and non-oscillatory property problems. We demonstrate that the hybrid WENO-M scheme preserves the advantages of WENO-M and the ratio of computational costs of hybrid WENO-M and hybrid WENO-JS is smaller than that of WENO-M and WENO-JS.  相似文献   

12.
指出在改进的完美并发签名方案中,签名方可将多个待签名消息绑定在同一个Keystone上,只让其他签名方知道其中一个消息,该情况对于各签名方是不公平的。提出并定义完美并发签名的可追踪性,给出一个针对完美并发签名方案可追踪性的攻击实例及对应的修订方案,待签名消息与Keystone一起作为Keystone transfer函数的输入参数,实现了签名消息与Keystone的唯一绑定,使修订后的方案满足可追究性要求。  相似文献   

13.
A Genetic Algorithm (GA) is a heuristic to find exact or approximate solutions to optimization and search problems within an acceptable time. We discuss GAs from an architectural perspective, offering a general analysis of performance of GAs on multi-core CPUs and on many-core GPUs. Based on the widely used Parallel GA (PGA) schemes, we propose the best one for each architecture. More specifically, the Asynchronous Island scheme, Island/Master–Slave Hierarchy PGA and Island/Cellular Hierarchy PGA are the best for multi-core, multi-socket multi-core and many-core architectures, respectively. Optimization approaches and rules based on a deep understanding of multi- and many-core architectures are also analyzed and proposed. Finally, the comparison of GA performance on multi-core and many-core architectures are discussed. Three real GA problems are used as benchmarks to evaluate our analysis and findings.There are three extra contributions compared to previous work. Firstly, our findings based on deeply analyzing architectures can be applied to all GA problems, even for other parallel computing, not for a particular GA problem. Secondly, the performance of GAs in our work not only concerns execution speed, also the solution quality has not been considered seriously enough. Thirdly, we propose the theoretical performance and optimization models of PGA on multi-core and many-core architectures, finding a more practical result of the performance comparison of the GA on these architectures, so that the speedup presented in this work is more reasonable and is a better guide to practical decisions.  相似文献   

14.
The Ultra-dense Heterogeneous Network (HetNet), which consists of macro-cells and pico-cells, has been recognized as a key technique to improve network performance. However, the increasing number of pico-cells also causes severe interference including inter-cell interference and intra-cell interference. Therefore, interference management has become an important issue in ultra-dense HetNets, and the traditional enhanced inter-cell interference coordination (eICIC) scheme is no longer fit for the high density of small cells. In this paper, a hybrid interference management method based on dynamic eICIC and coordinated multi-point transmission (CoMP) is proposed. Firstly, a virtual cell is established based on the characteristics of ultra dense HetNets. Then, a novel joint dynamic eICIC scheme combined with multi-user beamforming is deployed to eliminate the inter-cell interference, and improve the throughput of virtual cell significantly without sharp decrease of throughput of macro-cell. Furthermore, a virtual cell based joint transmission scheme is deployed with a power control algorithm, which can obviously increase the spectrum efficiency of virtual cell edge. Simulation results verify that the proposed scheme can achieve better spectrum efficiency both at macro-cell and virtual cell edges, and the network throughput is also improved.  相似文献   

15.
JCE是一个为Java程序提供机密性安全服务的框架.通过对JCE的设计原理的分析,以及对基于加密卡的JCE的3种可能实现方案的比较,提出了基于Microsoft CryptoAPI且遵照JCE的标准接口来实现CSP的方案,以便于对Java应用中的关键数据实现硬件加密.此外,还具体讨论了实现该CSP的步骤及其关键技术.  相似文献   

16.
基于组合杂交变分原理的4节点轴对称元   总被引:3,自引:1,他引:2  
本文基于组合杂交变分原理推导四节点的轴对称元,依据能量协调条件,导出含有8个参数的轴对称应力模式.其优越性能突出表现在离散模型对于计算背景(如单元畸变和材料的不可压缩性等)的广泛适应性.数值算倒结果表明,该轴对称元位移和应力明显优于其它的轴对称元.  相似文献   

17.
This paper presents a new neural network training scheme for pattern recognition applications. Our training technique is a hybrid scheme which involves, firstly, the use of the efficient BFGS optimisation method for locating minima of the total error function and, secondly, the use of genetic algorithms for finding a global minimum. This paper also describes experiments that compare the performance of our scheme with three other hybrid schemes of this kind when applied to challenging pattern recognition problems. Experiments have shown that our scheme gives better results than others.  相似文献   

18.
Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

19.
尽管当前基于全局光照模型的图形绘制方法可以渲染出高质量的图象 ,但因为其计算量巨大 ,难以适用于诸如建筑物漫游、虚拟现实等对绘制速度有严格要求的场合 .为此引入光学映射虚物体的概念 ,利用构建在联网PC机上的集群系统 ,并行创建反射和折射虚物体 ,然后利用集群中各节点的图形加速硬件 ,像处理实际三维物体一样绘制这些虚物体 ,可以快速地绘制出反射 /折射效果的图象 .实验结果证明 ,该方法利用 CU P的计算能力和图形硬件的加速特性实现了真实感图形的高性能绘制 ,特别适用于诸如建筑物漫游、计算机动画和虚拟现实等要求交互式绘制的场合  相似文献   

20.
Multiple high-order time-integration schemes are used to solve stiff test problems related to the Navier-Stokes (NS) equations. The primary objective is to determine whether high-order schemes can displace currently used second-order schemes on stiff NS and Reynolds averaged NS (RANS) problems, for a meaningful portion of the work-precision spectrum. Implicit-Explicit (IMEX) schemes are used on separable problems that naturally partition into stiff and nonstiff components. Non-separable problems are solved with fully implicit schemes, oftentimes the implicit portion of an IMEX scheme. The convection-diffusion-reaction (CDR) equations allow a term by term stiff/nonstiff partition that is often well suited for IMEX methods. Major variables in CDR converge at near design-order rates with all formulations, including the fourth-order IMEX additive Runge-Kutta (ARK2) schemes that are susceptible to order reduction. The semi-implicit backward differentiation formulae and IMEX ARK2 schemes are of comparable efficiency. Laminar and turbulent aerodynamic applications require fully implicit schemes, as they are not profitably partitioned. All schemes achieve design-order convergence rates on the laminar problem. The fourth-order explicit singly diagonally implicit Runge-Kutta (ESDIRK4) scheme is more efficient than the popular second-order backward differentiation formulae (BDF2) method. The BDF2 and fourth-order modified extended backward differentiation formulae (MEBDF4) schemes are of comparable efficiency on the turbulent problem. High precision requirements slightly favor the MEBDF4 scheme (greater than three significant digits). Significant order reduction plagues the ESDIRK4 scheme in the turbulent case. The magnitude of the order reduction varies with Reynolds number. Poor performance of the high-order methods can partially be attributed to poor solver performance. Huge time steps allowed by high-order formulations challenge the capabilities of algebraic solver technology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号