期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallelization of the Kalman filter on multicore computational platforms

O. Rosén A. Medvedev T. Wigren 《Control Engineering Practice》2013,21(9):1188-1194

Parallelization of the Kalman filter algorithm, with emphasis on the specific demands of multicore architecture implementation, is investigated. The approach is based on the nonrestrictive assumption of a banded system matrix. Both time-varying and time-invariant systems can be generally transformed to such a form. The proposed method is applied to a radio interference power estimation problem for which speedup evaluations using up to eight cores are performed. It is shown that the algorithm is capable of achieving linear speedup in the number of cores used, while speedup factors for a parallel BLAS implementation are less than two. An algorithm analysis that provides guidelines to the choice of implementation hardware to meet a desired performance is also provided. 相似文献

2.

Control of Markov chains with safety bounds

Arapostathis A. Kumar R. Hsu S.-P. 《Automation Science and Engineering, IEEE Transactions on》2005,2(4):333-343

相似文献

3.

Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

Kamesh MadduriEun-Jin Im Khaled Z. Ibrahim Samuel WilliamsStéphane Ethier Leonid Oliker 《Parallel Computing》2011,37(9):501-520

The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3-4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures. 相似文献

4.

Naive asymptotics for hitting time bounds in Markov chains

Vernon Rego 《Acta Informatica》1992,29(6-7):579-594

A set of sufficient conditions is obtained for Markov chains to yield upper and lower passage time bounds. While obtaining expected passage times is strictly a numerical procedure for general Markov chains, the results presented here outline a simple approach to bound expected passage times provided the chains satisfy certain easy to check criteria. The results may be useful in modelling situations, such as in the analysis of algorithms, where simple ways of obtaining average complexity estimates are required.Research performed at the Mathematical Sciences Section of Oak Ridge National Laboratory under the auspices of the Faculty Research Participation Program of Oak Ridge Associated Universities, and supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. DOE, under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems, Inc.Also supported in part by the National Science Foundation, under award ASC-9002225 and NATO award CRG900108 相似文献

5.

Intel多核与集成众核上CFD程序的OpenMP性能分析

《计算机科学与探索》2015,(10)

相似文献

6.

Network interfaces for programmable NICs and multicore platforms

Andrés Ortiz Julio Ortega Antonio F. Díaz Alberto Prieto 《Computer Networks》2010,54(3):357-376

The availability of multicore processors and programmable NICs, such as TOEs (TCP/IP Offloading Engines), provides new opportunities for designing efficient network interfaces to cope with the gap between the improvement rates of link bandwidths and microprocessor performance. This gap poses important challenges related with the high computational requirements associated to the traffic volumes and wider functionality that the network interface has to support. This way, taking into account the rate of link bandwidth improvement and the ever changing and increasing application demands, efficient network interface architectures require scalability and flexibility. An opportunity to reach these goals comes from the exploitation of the parallelism in the communication path by distributing the protocol processing work across processors which are available in the computer, i.e. multicore microprocessors and programmable NICs.Thus, after a brief review of the different solutions that have been previously proposed for speeding up network interfaces, this paper analyzes the onloading and offloading alternatives. Both strategies try to release host CPU cycles by taking advantage of the communication workload execution in other processors present in the node. Nevertheless, whereas onloading uses another general-purpose processor, either included in a chip multiprocessor (CMP) or in a symmetric multiprocessor (SMP), offloading takes advantage of processors in programmable network interface cards (NICs). From our experiments, implemented by using a full-system simulator, we provide a fair and more complete comparison between onloading and offloading. Thus, it is shown that the relative improvement on peak throughput offered by offloading and onloading depends on the rate of application workload to communication overhead, the message sizes, and on the characteristics of the system architecture, more specifically the bandwidth of the buses and the way the NIC is connected to the system processor and memory. In our implementations, offloading provides lower latencies than onloading, although the CPU utilization and interrupts are lower for onloading. Taking into account the conclusions of our experimental results, we propose a hybrid network interface that can take advantage of both, programmable NICs and multicore processors. 相似文献

7.

Linear stochastic approximation driven by slowly varying Markov chains

Vijay R. Konda John N. Tsitsiklis 《Systems & Control Letters》2003,50(2):95-102

We study a linear stochastic approximation algorithm that arises in the context of reinforcement learning. The algorithm employs a decreasing step-size, and is driven by Markov noise with time-varying statistics. We show that under suitable conditions, the algorithm can track the changes in the statistics of the Markov noise, as long as these changes are slower than the rate at which the step-size of the algorithm goes to zero. 相似文献

8.

Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Samuel Williams Jonathan Carter Leonid Oliker John Shalf Katherine Yelick 《Journal of Parallel and Distributed Computing》2009

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the high-performance computing (HPC) literature, including the Intel Xeon E5345 (Clovertown), AMD Opteron 2214 (Santa Rosa), AMD Opteron 2356 (Barcelona), Sun T5140 T2+ (Victoria Falls), as well as a QS20 IBM Cell Blade. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us to identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto-tuned LBMHD application achieves up to a 15 times improvement compared with the original code at a given concurrency. Additionally, we present a detailed analysis of each optimization, which reveals surprising hardware bottlenecks and software challenges for future multicore systems and applications. 相似文献

9.

A characterization of independence for competing Markov chains withapplications to stochastic Petri nets

Boucherie R.J. 《IEEE transactions on pattern analysis and machine intelligence》1994,20(7):536-544

This paper shows that some of the recently obtained product form results for stochastic Petri nets can be obtained as a special case of a simple exclusion mechanism for the product process of a collection of Markov chains 相似文献

10.

Kernel-assisted and topology-aware MPI collective communications on multicore/many-core platforms

Teng Ma George Bosilca Aurelien Bouteiller Jack J. Dongarra 《Journal of Parallel and Distributed Computing》2013

Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies. This calls for a collaborative approach between multiple layers of collective algorithms, dedicated to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications. 相似文献

11.

Fast 3D wavelet transform on multicore and many-core computing platforms

V. Galiano O. López-Granado M. P. Malumbres H. Migallón 《The Journal of supercomputing》2013,65(2):848-865

The three-dimensional wavelet transform (3D-DWT) has focused the attention of the research community, most of all in areas such as video watermarking, compression of volumetric medical data, multispectral image coding, 3D model coding and video coding. In this work, we present several strategies to speed up the 3D-DWT computation through multicore processing. An in depth analysis of the available compiler optimizations is also presented. Depending on both the multicore platform and the GOP size, the developed parallel algorithm obtains efficiencies above 95 % using up to four cores (or processes), and above 83 % using up to 12 cores. Furthermore, the extra memory requirements is under 0.12 % for low resolution video frames, and under 0.017 % for high resolution video frames. In this work, we also present a CUDA-based algorithm to compute the 3D-DWT using the shared memory for the extra memory demands, obtaining speed-ups up to 12.68 on the many-core GTX280 platform. In areas such as video processing or ultra high definition image processing, the memory requirements can significantly degrade the developed algorithms, however, our algorithm increases the memory requirements in a negligible percentage, being able to perform a nearly in-place computation of the 3D-DWT whereas in other state-of-the-art 3D-DWT algorithms it is quite common to use a different memory space to store the computed wavelet coefficients doubling in this manner the memory requirements. 相似文献

12.

High-level dataflow design of signal processing systems for reconfigurable and multicore heterogeneous platforms

Endri Bezati Richard Thavot Ghislain Roquier Marco Mattavelli 《Journal of Real-Time Image Processing》2014,9(1):251-262

The potential computational power of today multicore processors has drastically improved compared to the single processor architecture. Since the trend of increasing the processor frequency is almost over, the competition for increased performance has moved on the number of cores. Consequently, the fundamental feature of system designs and their associated design flows and tools need to change, so that, to support the scalable parallelism and the design portability. The same feature can be exploited to design reconfigurable hardware, such as FPGAs, which leads to rethink the mapping of sequential algorithms to HDL. The sequential programming paradigm, widely used for programming single processor systems, does not naturally provide explicit or implicit forms of scalable parallelism. Conversely, dataflow programming is an approach that naturally provides parallelism and the potential to unify SW and HDL designs on heterogeneous platforms. This study describes a dataflow-based design methodology aiming at a unified co-design and co-synthesis of heterogeneous systems. Experimental results on the implementation of a JPEG codec and a MPEG 4 SP decoder on heterogeneous platforms demonstrate the flexibility and capabilities of this design approach. 相似文献

13.

Risk-sensitive probability for Markov chains

Vahid Reza Ramezani Steven I. Marcus 《Systems & Control Letters》2005,54(5):493-502

The probability distribution of a Markov chain is viewed as the information state of an additive optimization problem. This optimization problem is then generalized to a product form whose information state gives rise to a generalized notion of probability distribution for Markov chains. The evolution and the asymptotic behavior of this generalized or “risk-sensitive” probability distribution is studied in this paper and a conjecture is proposed regarding the asymptotic periodicity of risk-sensitive probability and proved in the two-dimensional case. The relation between a set of simultaneous non-linear and the set of periodic attractors is analyzed. 相似文献

14.

Reachability problems for Markov chains

S. Akshay Timos Antonopoulos Joël Ouaknine James Worrell 《Information Processing Letters》2015

We consider the following decision problem: given a finite Markov chain with distinguished source and target states, and given a rational number r, does there exist an integer n such that the probability to reach the target from the source in n steps is r? This problem, which is not known to be decidable, lies at the heart of many model checking questions on Markov chains. We provide evidence of the hardness of the problem by giving a reduction from the Skolem Problem: a number-theoretic decision problem whose decidability has been open for many decades. 相似文献

15.

Thermal analysis of stochastic DVFS-enabled multicore real-time systems

Morteza Mohaqeqi Mehdi Kargahi 《The Journal of supercomputing》2015,71(12):4594-4622

相似文献

16.

Causal coding and control for Markov chains

P. Varaiya J. Walrand 《Systems & Control Letters》1983,3(4):189-192

A Markov chain is controlled by a decision maker receiving his observations of the state via a noisy memoriless channel. That information is encoded causally. The encoder is assumed to have perfect channel feedback information.Separation results are derived and used to prove that encoding is useless for a class of symmetric channels.This paper extends the results of the authors (1983) by using methods similar to those of that paper. 相似文献

17.

飞思卡尔推出QorIQ通信平台首款八核微处理器行业最高标准

周鑫《电子技术应用》2008,34(9)

<正> 飞思卡尔半导体推出 QorIQ P4080多核处理器——一个旨在为嵌入式多核空间中的性能、功效和编程性设定新标准的非常先进的八核通信处理器。P4080多核处理器是飞思卡尔新 QorIQ 平台的标志性成员,基于45纳米处理技术。它集成了增强的 PowerArchitecture~(TM)内核、三级缓存分层、创新 CoreNet~(TM)片上结构和数据路径加速,可在最大30W 的功率电路内提供相似文献

18.

A tool for model-checking Markov chains

Holger Hermanns Joost-Pieter Katoen Joachim Meyer-Kayser Markus Siegle 《International Journal on Software Tools for Technology Transfer (STTT)》2003,4(2):153-172

Markov chains are widely used in the context of the performance and reliability modeling of various systems. Model checking of such chains with respect to a given (branching) temporal logic formula has been proposed for both discrete [34, 10] and continuous time settings [7, 12]. In this paper, we describe a prototype model checker for discrete and continuous-time Markov chains, the Erlangen–Twente Markov Chain Checker E⊢MC², where properties are expressed in appropriate extensions of CTL. We illustrate the general benefits of this approach and discuss the structure of the tool. Furthermore, we report on successful applications of the tool to some examples, highlighting lessons learned during the development and application of E⊢MC². Published online: 19 November 2002 Correspondence to: Holger Hermanns 相似文献

19.

Model-checking algorithms for continuous-time Markov chains 总被引：1，自引：0，他引：1

Baier C. Haverkort B. Hermanns H. Katoen J.-P. 《IEEE transactions on pattern analysis and machine intelligence》2003,29(6):524-541

Continuous-time Markov chains (CTMCs) have been widely used to determine system performance and dependability characteristics. Their analysis most often concerns the computation of steady-state and transient-state probabilities. This paper introduces a branching temporal logic for expressing real-time probabilistic properties on CTMCs and presents approximate model checking algorithms for this logic. The logic, an extension of the continuous stochastic logic CSL of Aziz et al. (1995, 2000), contains a time-bounded until operator to express probabilistic timing properties over paths as well as an operator to express steady-state probabilities. We show that the model checking problem for this logic reduces to a system of linear equations (for unbounded until and the steady-state operator) and a Volterra integral equation system (for time-bounded until). We then show that the problem of model-checking time-bounded until properties can be reduced to the problem of computing transient state probabilities for CTMCs. This allows the verification of probabilistic timing properties by efficient techniques for transient analysis for CTMCs such as uniformization. Finally, we show that a variant of lumping equivalence (bisimulation), a well-known notion for aggregating CTMCs, preserves the validity of all formulas in the logic. 相似文献

20.

Approximate regenerative-block bootstrap for Markov chains

Patrice Bertail Stéphan Clémençon 《Computational statistics & data analysis》2008,52(5):2739-2756

The (approximate) regenerative block-bootstrap for bootstrapping general Harris Markov chains has recently been developed. It is built on the renewal properties of the chain, or of a Nummelin extension of the latter. It has theoretical properties that surpass other existing methods within the Markovian framework. The practical issues related to the implementation of this specific resampling method are discussed. Various simulation studies for investigating its performance and comparing it to other bootstrap resampling schemes, standing as natural candidates in the Markov setting are presented. 相似文献