首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
This paper proposes a novel Colored Petri Net (CPN) based dynamic scheduling scheme, which aims at scheduling real-time tasks on multiprocessor system-on-chip (MPSoC) platforms. Our CPN based scheme addresses two key issues on task scheduling problems, dependence detecting and task dispatching. We model inter-task dependences using CPN, including true-dependences, output-dependences, anti-dependences and structural dependences. The dependences can be detected automatically during model execution. Additionally, the proposed model takes the checking of real-time constraints into consideration. We evaluated the scheduling scheme on the state-of-art FPGA based multiprocessor hardware system and modeled the system behavior using CPN tools. Simulations and state space analyses are conducted on the model. Experimental results demonstrate that our scheme can achieve 98.9% of the ideal speedup on a real FPGA based hardware prototype.  相似文献   

2.
With the advent of multicore processors, it has become imperative to write parallel programs if one wishes to exploit the next generation of processors. This paper deals with skyline computation as a case study of parallelizing database operations on multicore architectures. First we parallelize three sequential skyline algorithms, BBS, SFS, and SSkyline, to see if the design principles of sequential skyline computation also extend to parallel skyline computation. Then we develop a new parallel skyline algorithm PSkyline based on the divide-and-conquer strategy. Experimental results show that all the algorithms successfully utilize multiple cores to achieve a reasonable speedup. In particular, PSkyline achieves a speedup approximately proportional to the number of cores when it needs a parallel computation the most.  相似文献   

3.
Parallel nonlinear preconditioners, for solving mildly nonlinear systems, are proposed. These algorithms are based on both the Fletcher–Reeves version of the nonlinear conjugate gradient method and a polynomial preconditioner type based on block two-stage methods. The behavior of these algorithms is analyzed when incomplete LU factorizations are used in order to obtain the inner splittings of the block two-stage method. As our illustrative example we have considered a nonlinear elliptic partial differential equation, known as the Bratu problem. The reported experiments show the performance of the algorithms designed in this work on two multicore architectures.  相似文献   

4.
With the advance of technology, the power density (temperature) increases rapidly to threaten system performance, reliability, and even system safety. Development of a thermal management method to reduce thermal hotspots and distribute the temperature uniformly has become an important issue. Therefore, dynamic thermal management (DTM) has emerged as an effective technique to remedy these issues above. In this paper, we propose a proactive thermal management scheme on the Criticore platform developed by our team to avoid suffering high temperature of the system. The proposed approach can schedule threads to prevent the system from overheating with the aid of the thermal sensors and the Power Management Circuit (PMC) designed in the Criticore. Furthermore, a novel thread migration is also presented to increase the reliability of the system.  相似文献   

5.
As multicore systems continue to gain ground in the high‐performance computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine‐grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data (referred to as ‘tiles’). These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out‐of‐order execution of the tasks that will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can be exploited only at the level of the BLAS operations and with vendor implementations. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

6.
Biological sequence comparison is one of the most important tasks in Bioinformatics. Owing to the fast growth of databases that contain biological information, sequence comparison represents an important challenge for high‐performance computing, especially when very long sequences are compared, i.e. the complete genome of several organisms. The Smith–Waterman (SW) algorithm is an exact method based on dynamic programming to quantify local similarity between sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). Concurrently, there is a paradigm shift towards chip multiprocessors in computer architecture, which offer a huge amount of potential performance that can only be exploited efficiently if applications are effectively mapped and parallelized. In this work, we analyze how large‐scale biology sequence comparison takes advantage of the current and future multicore architectures. Our starting point is the performance analysis of the current multicore IBM Cell B.E. processor; we analyze two different SW implementations on the Cell B.E. Then, using simulation tools, we study the performance scalability when a many‐core architecture is used for performing long DNA sequence comparison. We investigate the efficient memory organization that delivers the maximum bandwidth with the minimum cost. Our results show that a heterogeneous architecture can be an efficient alternative to execute challenging bioinformatic workloads. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

7.
The wavelet tree has become a very useful data structure to efficiently represent and query large volumes of data in many different domains, from bioinformatics to geographic information systems. One problem with wavelet trees is their construction time. In this paper, we introduce two algorithms that reduce the time complexity of a wavelet tree’s construction by taking advantage of nowadays ubiquitous multicore machines. Our first algorithm constructs all the levels of the wavelet in parallel with O(n) time and \(O(n\lg \sigma + \sigma \lg n)\) bits of working space, where n is the size of the input sequence and \(\sigma \) is the size of the alphabet. Our second algorithm constructs the wavelet tree in a domain decomposition fashion, using our first algorithm in each segment, reaching \(O(\lg n)\) time and \(O(n\lg \sigma + p\sigma \lg n/\lg \sigma )\) bits of extra space, where p is the number of available cores. Both algorithms are practical and report good speedup for large real datasets.  相似文献   

8.
We propose an approach to estimate the power consumption of algorithms, as a function of the frequency and number of cores, using only a very reduced set of real power measures. In addition, we also provide the formulation of a method to select the voltage–frequency scaling–concurrency throttling configurations that should be tested in order to obtain accurate estimations of the power dissipation. The power models and selection methodology are verified using two real scientific application: the stencil-based 3D MPDATA algorithm and the conjugate gradient (CG) method for sparse linear systems. MPDATA is a crucial component of the EULAG model, which is widely used in weather forecast simulations. The CG algorithm is the keystone for iterative solution of sparse symmetric positive definite linear systems via Krylov subspace methods. The reliability of the method is confirmed for a variety of ARM and Intel architectures, where the estimated results correspond to the real measured values with the average error being slightly below 5% in all cases.  相似文献   

9.
We propose a novel parallel algorithm for generating all the sequences of binary reflected Gray code for a given number of bits as input, targeting machines with multicore architectures. A theoretical analysis of work and span, as well as parallelism of this algorithm, is carried out following a multithreaded implementation using Cilk++ on a multicore machine. Theoretical analysis of this algorithm shows a parallelism of Θ(2n/log n) and achieves a linear speedup on 12 cores for input data of sufficiently large size.  相似文献   

10.
Given the proliferation of layered, multicore- and SMT-based architectures, it is imperative to deploy and evaluate important, multi-level, scientific computing codes, such as meshing algorithms, on these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level, medium-grain at the cavity level and fine-grain at the element level. This multi-grain data parallel approach targets clusters built from commercially available SMTs and multicore processors. The exploitation of the coarser degree of granularity facilitates scalability both in terms of execution time and problem size on loosely-coupled clusters. The exploitation of medium-grain parallelism allows performance improvement at the single node level. Our experimental evaluation shows that the first generation of SMT cores is not capable of taking advantage of fine-grain parallelism in PCDM. Many of our experimental findings with PCDM extend to other adaptive and irregular multigrain parallel algorithms as well.  相似文献   

11.
The numerical solution of shallow water systems is useful for several applications related to geophysical flows, but the big dimensions of the domains suggests the use of powerful accelerators to obtain numerical results in reasonable times. This paper addresses how to speed up the numerical solution of a first order well-balanced finite volume scheme for 2D one-layer shallow water systems by using modern Graphics Processing Units (GPUs) supporting the NVIDIA CUDA programming model. An algorithm which exploits the potential data parallelism of this method is presented and implemented using the CUDA model in single and double floating point precision. Numerical experiments show the high efficiency of this CUDA solver in comparison with a CPU parallel implementation of the solver and with respect to a previously existing GPU solver based on a shading language.  相似文献   

12.
13.
14.
This paper describes two dynamic core allocation techniques for video decoding on homogeneous and heterogeneous embedded multicore platforms with the objective of reducing energy consumption while guaranteeing performance. While decoding a frame, the scheme measures “slack” and “overshoot” over the budgeted decode time and amortizes across the neighboring frames to achieve overall performance, compensating for the overshoot with the slack time. It allocates, on a per-frame basis, an appropriate number and types of cores for decoding to guarantee performance, while saving energy by using clock gating to switch off unused cores. Using the Sniper simulator to evaluate the implementation of the scheme on a modern embedded processor, we get an energy saving of 6%–61% while strictly adhering to the required performance of 75 fps on homogeneous multicore architectures. We receive an energy saving of 2%–46% while meeting the performance of 25 fps on heterogeneous multicore architectures. Thus, we show that substantial energy savings can be achieved in video decoding by employing dynamic core allocation, compared with the default strategy of allocating as many cores as available.  相似文献   

15.
Asymmetric multicore processors (AMP) have become popular in both high-end and low-end computing systems due to its flexibility and high performance. A performance asymmetric multicore architecture (P-AMP) is the subcategory of AMP, which integrates the different micro-architecture cores in the same chip. Due to the heterogeneity nature of cores and applications, recognizing an optimal hardware configuration in terms of core, voltage-frequency pair for each application is still an NP-hard problem. Optimization of energy-delay product (EDP) is an additional challenging task in such architectures.To address these challenges, we developed a novel core prediction model called lightweight-deep neural network (LW-DNN) for asymmetric multicore processors. The proposed LW-DNN includes three phases, feature selection, feature optimization, and core prediction module. In the first and second phases, workload characteristics are extracted and optimized using the pre-processing algorithm and in the third phase, it predicts the appropriate cores for each workload at runtime to enhance the energy-efficiency and performance.We modeled a deep learning neural network using scikit-learn python library and evaluated in ODROID XU3 ARM big-Little performance asymmetric multicore platform. The embedded benchmarks we considered are MiBench, IoMT, Core-Mark workloads. The proposed LW-DNN prediction module compared with other traditional algorithms in terms of accuracy, execution time, energy consumption, and energy-delay product. The experimental results illustrate that accuracy achieved up to 97% in core prediction, and the average improvement in minimization of energy consumption is 33%, 35% in energy-delay product, 33% minimized in execution time correspondingly.  相似文献   

16.
椭圆曲线基点的判断是实现椭圆曲线密码系统(elliptic curve cryptosystems,ECC)的基础。提出了一种针对ECC的基点并行判断算法,此算法基于OpenMP共享存储模型,其并行效率在多核处理器平台上获得了显著的提高,最高达到了110%。实验表明,并行后的基点判断算法的运行速度相比并行前得到了明显提高;并行效率随着n(标量的二进制长度)的增大而逐渐趋于稳定;循环缓存容量对并行效率的提升没有影响;算法能够抵抗旁道攻击。因此,该算法可用于提高ECC基点的选取速度,进而提高整体加/解密速度。  相似文献   

17.
Servet is a suite of benchmarks focused on detecting a set of parameters with high influence on the overall performance of multicore systems. These parameters can be used for autotuning codes to increase their performance on multicore clusters. Although Servet has been proved to detect accurately cache hierarchies, bandwidths and bottlenecks in memory accesses, as well as the communication overhead among cores, up to now the impact of the use of this information on application performance optimization has not been assessed. This paper presents a novel algorithm that automatically uses Servet for mapping parallel applications on multicore systems and analyzes its impact on three testbeds using three different parallel programming models: message-passing, shared memory and partitioned global address space (PGAS). Our results show that a suitable mapping policy based on the data provided by this tool can significantly improve the performance of parallel applications without source code modification.  相似文献   

18.
The objective of this paper is to analyze the dynamic scheduling of dense linear algebra algorithms on shared‐memory, multicore architectures. Current numerical libraries (e.g., linear algebra package) show clear limitations on such emerging systems mainly because of their coarse granularity tasks. Thus, many numerical algorithms need to be redesigned to better fit the architectural design of the multicore platform. The parallel linear algebra for scalable multicore architectures library developed at the University of Tennessee tackles this challenge by using tile algorithms to achieve a finer task granularity. These tile algorithms can then be represented by directed acyclic graphs, where nodes are the tasks and edges are the dependencies between the tasks. The paramount key to achieve high performance is to implement a runtime environment to efficiently schedule the execution of the directed acyclic graph across the multicore platform. This paper studies the impact on the overall performance of some parameters, both at the level of the scheduler (e.g., window size and locality) and the algorithms (e.g., left‐looking and right‐looking variants). We conclude that some commonly accepted rules for dense linear algebra algorithms may need to be revisited. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

19.
The skyline operator determines points in a multidimensional dataset that offer some optimal trade-off. State-of-the-art CPU skyline algorithms exploit quad-tree partitioning with complex branching to minimise the number of point-to-point comparisons. Branch-phobic GPU skyline algorithms rely on compute throughput rather than partitioning, but fail to match the performance of sequential algorithms. In this paper, we introduce a new skyline algorithm, SkyAlign, that is designed for the GPU, and a GPU-friendly, grid-based tree structure upon which the algorithm relies. The search tree allows us to dramatically reduce the amount of work done by the GPU algorithm by avoiding most point-to-point comparisons at the cost of some compute throughput. This trade-off allows SkyAlign to achieve orders of magnitude faster performance than its predecessors. Moreover, a NUMA-oblivious port of SkyAlign outperforms native multicore state of the art on challenging workloads by an increasing margin as more cores and sockets are utilised.  相似文献   

20.
This work studies how to adapt the number of threads of a parallel Interval Branch and Bound algorithm to the available computational resources based on its current performance. Basically, a thread can create a new thread that will process part of the ancestor workload. In this way, load balancing is inherent to the creation of threads. The applications in which we are interested use branch-and-bound algorithms which are highly irregular and therefore difficult to predict. The proposed methods can be used for more predictable algorithms as well. This research complements and does not substitute other devices that improve the exploitation of the system, such as dynamic scheduling policies or work-stealing. Several approaches are presented. They differ in the metrics used and in the need or not having to modify the Operating System (O.S.). The scenario for this research is just one multithreaded application running in a multicore architecture. Experimental results show that the appropriate number of running threads can be determined at run-time, avoiding having to statically establish the number of threads of an application. Thread creation decisions have to be made frequently to obtain better results, but are time-consuming. One of the presented models uses the existence of an idle processor to carry out these decisions, obtaining the desired results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号