首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Parallelization of the Kalman filter algorithm, with emphasis on the specific demands of multicore architecture implementation, is investigated. The approach is based on the nonrestrictive assumption of a banded system matrix. Both time-varying and time-invariant systems can be generally transformed to such a form. The proposed method is applied to a radio interference power estimation problem for which speedup evaluations using up to eight cores are performed. It is shown that the algorithm is capable of achieving linear speedup in the number of cores used, while speedup factors for a parallel BLAS implementation are less than two. An algorithm analysis that provides guidelines to the choice of implementation hardware to meet a desired performance is also provided.  相似文献   

2.
3.
Vernon Rego 《Acta Informatica》1992,29(6-7):579-594
A set of sufficient conditions is obtained for Markov chains to yield upper and lower passage time bounds. While obtaining expected passage times is strictly a numerical procedure for general Markov chains, the results presented here outline a simple approach to bound expected passage times provided the chains satisfy certain easy to check criteria. The results may be useful in modelling situations, such as in the analysis of algorithms, where simple ways of obtaining average complexity estimates are required.Research performed at the Mathematical Sciences Section of Oak Ridge National Laboratory under the auspices of the Faculty Research Participation Program of Oak Ridge Associated Universities, and supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. DOE, under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems, Inc.Also supported in part by the National Science Foundation, under award ASC-9002225 and NATO award CRG900108  相似文献   

4.
The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3-4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.  相似文献   

5.
6.
The availability of multicore processors and programmable NICs, such as TOEs (TCP/IP Offloading Engines), provides new opportunities for designing efficient network interfaces to cope with the gap between the improvement rates of link bandwidths and microprocessor performance. This gap poses important challenges related with the high computational requirements associated to the traffic volumes and wider functionality that the network interface has to support. This way, taking into account the rate of link bandwidth improvement and the ever changing and increasing application demands, efficient network interface architectures require scalability and flexibility. An opportunity to reach these goals comes from the exploitation of the parallelism in the communication path by distributing the protocol processing work across processors which are available in the computer, i.e. multicore microprocessors and programmable NICs.Thus, after a brief review of the different solutions that have been previously proposed for speeding up network interfaces, this paper analyzes the onloading and offloading alternatives. Both strategies try to release host CPU cycles by taking advantage of the communication workload execution in other processors present in the node. Nevertheless, whereas onloading uses another general-purpose processor, either included in a chip multiprocessor (CMP) or in a symmetric multiprocessor (SMP), offloading takes advantage of processors in programmable network interface cards (NICs). From our experiments, implemented by using a full-system simulator, we provide a fair and more complete comparison between onloading and offloading. Thus, it is shown that the relative improvement on peak throughput offered by offloading and onloading depends on the rate of application workload to communication overhead, the message sizes, and on the characteristics of the system architecture, more specifically the bandwidth of the buses and the way the NIC is connected to the system processor and memory. In our implementations, offloading provides lower latencies than onloading, although the CPU utilization and interrupts are lower for onloading. Taking into account the conclusions of our experimental results, we propose a hybrid network interface that can take advantage of both, programmable NICs and multicore processors.  相似文献   

7.
针对联机分析处理(OLAP)中事实表与多个维表之间的星形连接执行代价较高的问题,提出了一种在先进的多核中央处理器(CPU)和图形处理器(GPU)上的星形连接优化方法.首先,对于多核CPU和GPU平台的星形连接中的物化代价问题,提出了基于向量索引的CPU和GPU平台上的向量化星形连接算法;然后,通过面向CPU cache...  相似文献   

8.
We study a linear stochastic approximation algorithm that arises in the context of reinforcement learning. The algorithm employs a decreasing step-size, and is driven by Markov noise with time-varying statistics. We show that under suitable conditions, the algorithm can track the changes in the statistics of the Markov noise, as long as these changes are slower than the rate at which the step-size of the algorithm goes to zero.  相似文献   

9.
In deeply embedded heterogeneous multicores the allocation of data to memories is crucial for application performance. For applications with stringent throughput constraints, the allocation is often done manually by carefully assigning static memory locations to the logical buffers of the application. Today, designers are confronted with applications with thousands of buffers and architectures with hundreds of memories, rendering manual approaches impractical. In this paper we present an automatic approach for statically allocating logical buffers to physical memories, assuming a fixed task-to-processor mapping and respecting multiple throughput constraints.In our approach, we model the application in a data-centric way, by explicitly defining buffers and associating computational tasks that access the buffers within well-specified time intervals. Besides, we use an architecture model that allows to perform an allocation that is aware of the topology of the multicore and the physical bandwidth constraints of the interconnect. We present a layered approach to describe and solve the buffer-allocation problem as well as related subproblems, using mixed-integer linear programming. We show that the buffer-allocation problem is NP-complete, and present a more scalable formulation as a semi-definite programming problem. We evaluate the proposed LP methods by allocating around 1000 buffers corresponding to processing one frame in the Long-Term Evolution (LTE) standard, onto a multicore with 80 processing elements. We introduce a solution approach that allowed to find an optimal allocation in around 2 hours, which is at least two orders of magnitude faster than a straightforward formulation.  相似文献   

10.
We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the high-performance computing (HPC) literature, including the Intel Xeon E5345 (Clovertown), AMD Opteron 2214 (Santa Rosa), AMD Opteron 2356 (Barcelona), Sun T5140 T2+ (Victoria Falls), as well as a QS20 IBM Cell Blade. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 15 times improvement compared with the original code at a given concurrency. Additionally, we present a detailed analysis of each optimization, which reveals surprising hardware bottlenecks and software challenges for future multicore systems and applications.  相似文献   

11.
This paper shows that some of the recently obtained product form results for stochastic Petri nets can be obtained as a special case of a simple exclusion mechanism for the product process of a collection of Markov chains  相似文献   

12.
Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies. This calls for a collaborative approach between multiple layers of collective algorithms, dedicated to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications.  相似文献   

13.
The three-dimensional wavelet transform (3D-DWT) has focused the attention of the research community, most of all in areas such as video watermarking, compression of volumetric medical data, multispectral image coding, 3D model coding and video coding. In this work, we present several strategies to speed up the 3D-DWT computation through multicore processing. An in depth analysis of the available compiler optimizations is also presented. Depending on both the multicore platform and the GOP size, the developed parallel algorithm obtains efficiencies above 95 % using up to four cores (or processes), and above 83 % using up to 12 cores. Furthermore, the extra memory requirements is under 0.12 % for low resolution video frames, and under 0.017 % for high resolution video frames. In this work, we also present a CUDA-based algorithm to compute the 3D-DWT using the shared memory for the extra memory demands, obtaining speed-ups up to 12.68 on the many-core GTX280 platform. In areas such as video processing or ultra high definition image processing, the memory requirements can significantly degrade the developed algorithms, however, our algorithm increases the memory requirements in a negligible percentage, being able to perform a nearly in-place computation of the 3D-DWT whereas in other state-of-the-art 3D-DWT algorithms it is quite common to use a different memory space to store the computed wavelet coefficients doubling in this manner the memory requirements.  相似文献   

14.
The potential computational power of today multicore processors has drastically improved compared to the single processor architecture. Since the trend of increasing the processor frequency is almost over, the competition for increased performance has moved on the number of cores. Consequently, the fundamental feature of system designs and their associated design flows and tools need to change, so that, to support the scalable parallelism and the design portability. The same feature can be exploited to design reconfigurable hardware, such as FPGAs, which leads to rethink the mapping of sequential algorithms to HDL. The sequential programming paradigm, widely used for programming single processor systems, does not naturally provide explicit or implicit forms of scalable parallelism. Conversely, dataflow programming is an approach that naturally provides parallelism and the potential to unify SW and HDL designs on heterogeneous platforms. This study describes a dataflow-based design methodology aiming at a unified co-design and co-synthesis of heterogeneous systems. Experimental results on the implementation of a JPEG codec and a MPEG 4 SP decoder on heterogeneous platforms demonstrate the flexibility and capabilities of this design approach.  相似文献   

15.
We consider the following decision problem: given a finite Markov chain with distinguished source and target states, and given a rational number r, does there exist an integer n such that the probability to reach the target from the source in n steps is r? This problem, which is not known to be decidable, lies at the heart of many model checking questions on Markov chains. We provide evidence of the hardness of the problem by giving a reduction from the Skolem Problem: a number-theoretic decision problem whose decidability has been open for many decades.  相似文献   

16.
The probability distribution of a Markov chain is viewed as the information state of an additive optimization problem. This optimization problem is then generalized to a product form whose information state gives rise to a generalized notion of probability distribution for Markov chains. The evolution and the asymptotic behavior of this generalized or “risk-sensitive” probability distribution is studied in this paper and a conjecture is proposed regarding the asymptotic periodicity of risk-sensitive probability and proved in the two-dimensional case. The relation between a set of simultaneous non-linear and the set of periodic attractors is analyzed.  相似文献   

17.
Multicore accelerators are used today to supplement traditional superscalar processors in massively parallel computer nodes with extra floating‐point computation power. This paper presents our parallelization and performance enhancement and evaluation of the conjugate gradient (CG) linear equation solver with enhanced matrix multiplication on the Cell Broadband Engine accelerator. The paper also compares the CG performance results on the Cell and two CG implementations on a computer with two quadcore Xeon processors, one with OpenMP and the other with OpenMPI. We also report the enhancements made on the CG code and performance analysis of CG on single and dual Cell Broadband Engine packages with 8 and 16 synergistic processing elements and on Xeon for heptadiagonal matrices, in particular to matrix multiplication and synchronization. We also report the communication and computation time breakdowns and the floating point operations per second ratio. Our parallel CG solver is shown to scale well with data size, grid dimensionality, and number of cores. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

18.
A Markov chain is controlled by a decision maker receiving his observations of the state via a noisy memoriless channel. That information is encoded causally. The encoder is assumed to have perfect channel feedback information.Separation results are derived and used to prove that encoding is useless for a class of symmetric channels.This paper extends the results of the authors (1983) by using methods similar to those of that paper.  相似文献   

19.
20.
<正> 飞思卡尔半导体推出 QorIQ P4080多核处理器——一个旨在为嵌入式多核空间中的性能、功效和编程性设定新标准的非常先进的八核通信处理器。P4080多核处理器是飞思卡尔新 QorIQ 平台的标志性成员,基于45纳米处理技术。它集成了增强的 PowerArchitecture~(TM)内核、三级缓存分层、创新 CoreNet~(TM)片上结构和数据路径加速,可在最大30W 的功率电路内提供  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号